Font Size: a A A

Research On Data Mining Based Prediction Of Glycosylation In The Human Proteome

Posted on:2017-02-06Degree:MasterType:Thesis
Country:ChinaCandidate:F Y LiFull Text:PDF
GTID:2180330485980616Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Glycosylation is a ubiquitous type of protein post translational modification(PTM) in eukaryotic cells, which plays vital roles in various biological processes. It is estimated that >50% of the entire human proteome is glycosylated. However, it is still a significant challenge to identify glycosylation sites, which requires expensive/laborious experimental research. Thus, bioinformatics approaches that can predict the glycan occupancy at specific sequons in protein sequences/structures would be useful for understanding and utilizing this important PTM. In this study, we propose three novel approaches/algorithms called GlycoMine, GlycoMineStruct and PAnDE for glycosylation sites prediction. The main contents of this dissertation are described as follows:(1) GlycoMine approach for predicting glycosylation sites from protein sequences. This approach involved data collection and preprocessing at first. Secondly, a variety of features were extracted, including sequence features, predicted structural features, protein functional features, and functional annotations. Then extensive feature selection was performed using a two-step feature selection procedure, where the optimal feature subsets were selected for each glycosylation type. In the final stage, three RF-based classifiers were respectively trained for C-, N- and O-linked glycosylation sites. A five-fold cross-validation and independent tests demonstrated that GlycoMine outperformed other existing glycosylation sites prediction tools. Furthermore, two case studies demonstrated that GlycoMine can be applied rapidly to accurately identify potential novel glycosylation sites in a protein of interest, and GlycoMine can get the increase of AUC performance about 10% on average. On the other hand, we developed a web server for GlycoMine to allow users to perform bioinformatics analyses.(2) GlycoMineStruct approach for predicting glycosylation sites from protein structures. First, the glycosylated protein sequences of GlycoMine were mapped to the PDB database and preprocessed, then a variety of structural and sequential features were extracted and calculated. A two-step feature selection procedure was applied to characterize the most informative and contributive feature subsets for N- and O-linked glycosylation sites prediction. GlycoMinestruct(RF-based classifier) was trained using the final selected optimal feature subsets, comparison experiments demonstrated that GlycoMinestruct outperformed NGlyc Pred, and the AUC performance increased by 14.5%. The AUC performances of GlycoMinestruct achieve to 0.941 and 0.922 for N- and O-linked glycosylation sites prediction on independent test dataset, respectively. On the other hand, we also developed a web server for GlycoMinestruct to allow users to perform bioinformatics analyses.(3) An Positive Unlabeled learning algorithm termed as PAnDE. Glycosylation datasets can be regarded as positive unlabeled data, while GlycoMine and GlycoMinestruct have some limitations in selecting negative samples, it is difficult to select truly non-glycosylation samples. Therefore, we proposed an algorithm termed as PAnDE, by further relaxing the attribute independence assumption of PNB and PAODE algorithm. We performed empirical studies to compare PAnDE with PNB and PAODE on sequential and structural glycosylation datasets as well as 20 UCI datasets. The results demonstrate that PAnDE has outperformed the other two algorithms PNB and PAODE and highlight the predictive power of PAnDE and its scalability to glycosylation sites prediction.
Keywords/Search Tags:glycosylation site prediction, feature selection, positive unlabeled learning, GlycoMine, GlycoMineStruct, PAnDE
PDF Full Text Request
Related items