Font Size: a A A

Multi-sites Sequence Properties Based IDPs Predictions

Posted on:2016-09-17Degree:MasterType:Thesis
Country:ChinaCandidate:E S WuFull Text:PDF
GTID:2180330470980938Subject:Microbiology
Abstract/Summary:PDF Full Text Request
Intrinsically disordered proteins (IDPs) are a kind of natural protein that lack stable spatial structure, and it is closely related to many human diseases. IDPs have been one of the hot spot of protein study for their important biological roles. Because of lacking well defined three-dimensional structure, it is difficult to measure intrinsically disordered proteins using experimental methods. So computational method based analysis of sequence characteristics and prediction of intrinsically disordered proteins has been an effective way. In this paper, the intrinsically disordered protein research mainly includes the following two parts:one is mined the intrinsically disordered protein sequence features of ordered region and disordered region deeply, and explored sequence characteristic parameters can effectively distinguish the two regions; Second, multi-sites to further integrate into the sequence features, development the prediction algorithm to distinguish ordered region and disordered region, providing a new method for predicting the intrinsically disordered proteins.1. Mining sequence informations of intrinsically disordered proteinBased on a larger dataset derived from the latest version of Disprot database,749 ordered regions and 387 disordered regions with the sequence length greater than 30 amino acids are established. Sequence analysis shows that the sequence complexity of the ordered sequences generally higher than the disordered sequences, indicating disordered region has a more pronounced features of amino acid preferences. Further analysis showed that the sequence complexity is independent with sequence length. In order to reveal the amino acid preference in ordered and disordered regions, a systematic sequence analysis was built based on our dataset which found that the disordered regions prefer hydrophilic amino acids such as D, E, K, Q, S, T and the ordered regions prefer hydrophobic amino acids such as F, I, L, M, V, W, Y, they have different sequence features. To further illustrate the ordered and disordered regions differences in the distribution of amino acids and dipeptide, we combined different amino acids and CGR (Chaos Game Representation) analysis method to analysis sequence difference between ordered and disordered region deeply in this paper. The results indicate that there are significant differences between the disordered regions and the ordered regions of IDPs. CGR visual analysis indicated that the CGR maps of disordered and ordered region are different. Disordered region contains more sequences by repeated residues. These results provide a solid theoretical foundation for the intrinsically disordered protein prediction.2. The classification method based on the sequence features of ordered/disordered region.Based on the different features of ordered and disordered sequences, sequence complexity, the rate of 20 kinds of amino acids and 400 kinds of dipeptide were applied as the input parameters of classification algorithms, while pseudo-amino acid composition (PseAAC) was first introduced as the input parameters to describe multi-site features, combined with support vector machine (SVM) to develope classification algorithm for ordered and disordered regions. The results show that use PseAAC can be more effective to extract the informations of ordered and disordered region. Using PseAAC as input parameters, a better result was obtained:ACC is 79.22%, Sn is 89.31% SP is 59.70%, MCC is 0.5211, AUC is 0.8467. In addition, we found that classification parameters scaling can also improve the classification results. Therefore it is worthy of further study of data scaling effect on the classification results.In summary, we studied the analysis and classification algorithms of ordered and disordered regions in this paper. It reveals the different inherent characteristics between the two regions. We use PseAAC to mine sequence characteristics as further evidence from multi-site features. We can study the sequence association better, and it will provide a new method for developing IDPs prediction.
Keywords/Search Tags:Intrinsically Disordered Proteins, Sequence analysis, Prediction, Pseudo-Amino Acid Composition, Support Vector Machine
PDF Full Text Request
Related items