Font Size: a A A

Classification Of Protein By Using Support Vector Machine

Posted on:2005-10-14Degree:DoctorType:Dissertation
Country:ChinaCandidate:S W ZhangFull Text:PDF
GTID:1100360155477372Subject:Control theory and control engineering
Abstract/Summary:PDF Full Text Request
With the success of human genome project, the protein sequences entering into the data banks are rapidly increasing. The structures and functions of these proteins may be determined by means of experiments, but it is very time-consuming and almost impossible. Thus the scientists have being sought after the theoretical or computational methods for predicting the structures and functions of proteins. Several methods of classifying or predicting protein structures and functions based on the protein primary sequences are investigated in this dissertation. The main contributions are summarized as follows:1. A new idea of composite classification is raised, that is the support vector machine (SVM) algorithm is combined felicitously with two feature extraction methods of amino acid composition and the auto-correlation functions based on the amino acid index, to classify the homodimers and non-homodimers from the protein primary sequences. Compared with previous Garian's investigation, the total classifying accuracy of our method is 17.1 percentage points higher than that of Garian's method in 10CV test.2. Two new feature extraction methods are put forward by this dissertation, and two previous feature extraction methods are also introduced. Then these four feature extraction methods are combined felicitously with SVM and two classifying strategies to investigate the classification of homodimers, homotrimers, homotetramers and homohexamers from the protein primary sequences. The simulation results show that the performances of three feature extraction methods by incorporating the information of sequence order are higher than that of the conventional amino acid composition method. Among them, our weighted auto-correlation function method is the best one. Its total accuracy is 6.39 and 2.41 percentage points higher than that of amino acid composition and Chou's feature extraction methods respectively. The classification performance of using 'one-versus-one' strategy is superior to the 'one-versus-rest' strategy, and the total accuracy is 17.69 percentage points higher than that of 'one-versus-rest' strategy.3. A new method of composite classification, it is that the feature extraction method of auto-correlation function is combined felicitously with SVM and the strategy of 'improved unique one-versus-rest', to classify 27 class folds. The results show that the total classification accuracy of auto-correlation function method is about 7 percentage points higher than that of amino acid composition in independent test. The results of using 'improved unique one-versus-rest' strategy are superior to 'one-versus-rest' strategy, and the total accuracies of independent test and 5CV test are about 18, 12 percentage points higher than that of using 'one-versus-rest' strategy respectively.4. The weighted idea is introduced in this dissertation to form a new feature extraction method, that is, the weighted auto-correlation function method, to represent the protein sequences. And two classification strategies ('one-versus-rest' and 'one-versus-one') are also used to classifythe membrane proteins, and to predict the protein subcellular locations. The results are significantly improved:1) For membrane protein, the total accuracy of our new feature extraction method is 87.98% in jackknife test, which is 3.38 percentage points higher than that of amino acid composition with the same 'one-versus-rest' strategy and SVM; the total accuracy of one-versus-one' strategy may be up to 94.88% in jackknife test, which is 6.9 percentage points higher than that of "one-versus-rest" strategy.2) For protein subcellular location, the total predictive accuracies of prokaryotic subcellular location and eukaryotic subcellular location are 92.38% and 95.22% respectively in jackknife test, and the total predictive accuracy of eukaryotic subcellular location is far higher than that of Hua's result 79.4%. The total predictive accuracy of eukaryotic protein with 'one-versus-one' strategy is 12.19 percentage points higher than that of 'one-versus-rest' strategy in jackknife test. The total predictive accuracy of eukaryotic protein with the new feature extraction method is 2.96 percentage points higher than that of amino acid composition feature extraction method in jackknife test.5. In the end, the kernel functions and their parameters are simply discussed.
Keywords/Search Tags:Support Vector Machine (SVM), feature extraction, classifying strategy, weighted auto-correlation function, homodimer, homotrimer, homotetramer, homohexamer, folds, membrane protein, subcellular location
PDF Full Text Request
Related items