Research On Data Mining Based Prediction Of Glycosylation In The Human Proteome

Posted on:2017-02-06

Degree:Master

Type:Thesis

Country:China

Candidate:F Y Li

Full Text:PDF

GTID:2180330485980616

Subject:Software engineering

Abstract/Summary:

PDF Full Text Request

Glycosylation is a ubiquitous type of protein post translational modificationï¼ˆPTMï¼‰ in eukaryotic cells, which plays vital roles in various biological processes. It is estimated that >50% of the entire human proteome is glycosylated. However, it is still a significant challenge to identify glycosylation sites, which requires expensive/laborious experimental research. Thus, bioinformatics approaches that can predict the glycan occupancy at specific sequons in protein sequences/structures would be useful for understanding and utilizing this important PTM. In this study, we propose three novel approaches/algorithms called GlycoMine, GlycoMine^Struct and PAnDE for glycosylation sites prediction. The main contents of this dissertation are described as follows:ï¼ˆ1ï¼‰ GlycoMine approach for predicting glycosylation sites from protein sequences. This approach involved data collection and preprocessing at first. Secondly, a variety of features were extracted, including sequence features, predicted structural features, protein functional features, and functional annotations. Then extensive feature selection was performed using a two-step feature selection procedure, where the optimal feature subsets were selected for each glycosylation type. In the final stage, three RF-based classifiers were respectively trained for C-, N- and O-linked glycosylation sites. A five-fold cross-validation and independent tests demonstrated that GlycoMine outperformed other existing glycosylation sites prediction tools. Furthermore, two case studies demonstrated that GlycoMine can be applied rapidly to accurately identify potential novel glycosylation sites in a protein of interest, and GlycoMine can get the increase of AUC performance about 10% on average. On the other hand, we developed a web server for GlycoMine to allow users to perform bioinformatics analyses.ï¼ˆ2ï¼‰ GlycoMine^Struct approach for predicting glycosylation sites from protein structures. First, the glycosylated protein sequences of GlycoMine were mapped to the PDB database and preprocessed, then a variety of structural and sequential features were extracted and calculated. A two-step feature selection procedure was applied to characterize the most informative and contributive feature subsets for N- and O-linked glycosylation sites prediction. GlycoMinestructï¼ˆRF-based classifierï¼‰ was trained using the final selected optimal feature subsets, comparison experiments demonstrated that GlycoMinestruct outperformed NGlyc Pred, and the AUC performance increased by 14.5%. The AUC performances of GlycoMinestruct achieve to 0.941 and 0.922 for N- and O-linked glycosylation sites prediction on independent test dataset, respectively. On the other hand, we also developed a web server for GlycoMinestruct to allow users to perform bioinformatics analyses.ï¼ˆ3ï¼‰ An Positive Unlabeled learning algorithm termed as PAnDE. Glycosylation datasets can be regarded as positive unlabeled data, while GlycoMine and GlycoMinestruct have some limitations in selecting negative samples, it is difficult to select truly non-glycosylation samples. Therefore, we proposed an algorithm termed as PAnDE, by further relaxing the attribute independence assumption of PNB and PAODE algorithm. We performed empirical studies to compare PAnDE with PNB and PAODE on sequential and structural glycosylation datasets as well as 20 UCI datasets. The results demonstrate that PAnDE has outperformed the other two algorithms PNB and PAODE and highlight the predictive power of PAnDE and its scalability to glycosylation sites prediction.

Keywords/Search Tags:

glycosylation site prediction, feature selection, positive unlabeled learning, GlycoMine, GlycoMineStruct, PAnDE

PDF Full Text Request

Related items

1	Semi-supervised Prediction Of Protein Interaction Site From Unlabeled Sample Information
2	Research On Prediction Of Glycosylation By Deep Neural Network
3	Research On Prediction Algorithm Of Protein Phosphorylation Sites Based On Granular Computing
4	Research On MiRNA-disease Association Prediction Algorithm From Multi-source Heterogeneous Information
5	The Sign Prediction Models Based On Transfer Learning In Unlabeled Complex Networks
6	A Machine Learning Model For Runoff Prediction Based On Feature Selection And Joint Time-Frequency Analysis
7	Credit Default Prediction Based On Sequence Backward Feature Selection And Grouping Equalization Undersampling
8	Study On MiRNA With Positive And Unlabeled Learning Strategy And Matrix Completion
9	Research And Application Of Feature Modeling Algorithm Based On Age Prediction
10	Prediction Of Non-coding RNA Based On Feature Selection And Integration Algorithms