Font Size: a A A

Analysis Of Conservative Motif Features Of The Protein Superfamilies And The Theoretical Prediction For Protein Superfamilies

Posted on:2010-08-01Degree:MasterType:Thesis
Country:ChinaCandidate:S J MaFull Text:PDF
GTID:2120360278467575Subject:Theoretical Physics
Abstract/Summary:PDF Full Text Request
The conservative motifs can reflect the genetic relationship of protein superfamilies.They usually play an important role in protein function.Thus,the identification of the protein superfamily becomes increasingly important for protein function study.In this dissertation,the biostatistics and biomathematics methods are used to analyze the characteristics of motif in protein superfamily.The paper mainly includes the analysis of motif characteristics,conservative motif distribution,features extraction and the identification of the protein superfamily.The arrangement of this study is as follows:First,a new protein superfamily database that contains 16 protein superfamilies was established. The protein sequences were abstracted from the Structure Classification of Protein database(SCOP). ScanProsite and MEME are two vary available tools for finding protein motifs.Both of them were used to select the sequence motif.The position and function informations of motif were be analyzed. By discussing and analyzing their different function features,distribution and frequency,some important regularities of the position distribution arc obtained.These correlated regularities are very important for identification of different protein superfamilies.Second,the one-factor analysis of variance(one-way ANOVA) method was used to test the amino acid compositions,physical-chemical characteristics and the hybrid features.Some significant characteristics were extracted.This method not only can reduce the dimension of feature vector effectively,but also can provide a new kind of parameter for recognition of protein superfamily.In addition,the frequency of motif with known function and statistical significance were analyzed.The motif frequency defined in this dissertation was first used as a new feature for protein superfamily recognition. Finally,three parameter selection methods were applied to predict the protein superfamily.First, the amino acid compositions,dipeptide compositions,hydrophilicity and hydrophobicity, physicochemical and hybrid parameter models were selected as the prediction informational parameters of the minimum increment of diversity algorithm.Second,the new parameter models with statistically significant features were applied to predict the protein superfamily.Third,a new hybrid model which combined motif frequency and parameter models with statistic significance was used for identification of protein superfamily.The results indicated that the best prediction accuracy was obtained by the multi-parameter combination of information.The prediction results based on the extracted features are better than the sequence features,and they could effectively reduce the dimension of eigenvector.The overall prediction accuracy rate is 10%higher than the other two medols.By using the 400+M parameter model,the overall accuracy of Jackknife tests are 83.5%,87.1%,84.3%and 83.1%for the superfamilies of all-α,all-β,α/βandα+βprotein structure classes,respectively.
Keywords/Search Tags:protein superfamily, motif characteristics, conservative motif distribution, motif frequency, one-way ANOVA, minimum increment of diversity
PDF Full Text Request
Related items