Font Size: a A A

Research On Prediction Of Protein Domains Based On Support Vector Machines

Posted on:2010-09-01Degree:DoctorType:Dissertation
Country:ChinaCandidate:S X ZouFull Text:PDF
GTID:1100360272995632Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Since proteins provide some of the most fundamental information about many processes in almost all organisms, the ability to predict protein structure and functionhas become one of the most important goals in bioinformatics research. Protein domains represent one of the most useful avenues for the understanding of protein function and domain family-based analysis, and are of great importance in the study of individual proteins. Detecting the domain structure of a protein is a challenging problem that how to determine where is the amino acids in the protein domain or in the domain boundary for a given protein sequence. In detail there are two problems. One is that where are the domain or boundary in a given protein structure. The other is that the same problem in a sequence without the known structure. Relatively speaking the latter is more difficult.Support Vector machines (SVM) are a new statistical learning technique that can be seen as a new method for training classifiers based on polynomial functions, radial basis functions, neural networks, splines or other functions. Support Vector machines use a hyper-linear separating plane to create a classifier. For problems that can not be linearly separated in the input space, this machine offers a possibility to find a solution by making a non-linear transformation of the original input space into a high dimensional feature space, where an optimal separating hyperplane can be found. The performance of SVM drops significantly while facing imbalanced datasets, though it has been extensively studied and has shown remarkable success in many applications. Once more it is difficult to avoid such decrease when trying to improve the efficient of SVM on imbalanced datasets by modifying the algorithm itself only. Therefore, as the pretreatment of data, sampling is a popular strategy to handle the class imbalance problem since it re-balances the dataset directly.In this thesis there is an intensive study on the domain boundary detection only using a given protein sequence.A promising method for detecting the domain structure of a protein from sequence information alone was presented. Given a query sequence, our algorithm starts by searching the protein sequence database and generating a multiple alignment of all significant hits. The columns of the multiple alignment are analyzed using a variety of sources to define scores that reflect the domain-information-content of alignment columns, such as the conservation measures on the composition and classification of amino acids in each multiple alignment column, consistency and correlation measures, measures of structural flexibility. Information theory based principles are employed to maximize the information content. Besides we quote a method to predict domain boundary from protein sequence alone. The method is based on theory that the protein unique three dimensional structure is a result of the balance between the gain of attractive native interactions and the loss of conformational entropy. These scores are then combined using a support vector machine to label single columns as core-domain or boundary positions The overall accuracy of the method for a single protein chains dataset, is about 85 %.A novel undersampling method using distance-based maximal entropy in the feature space of SVMs is proposed. Its unique learning mechanism makes it an interesting candidate for dealing with imbalanced datasets, since SVMs only takes into account those data that are close to the boundary, i.e. the support vectors, for building its model. What's more important, as kernel-based methods, the classification of SVMs is defined in the feature space. So does our undersampling preprocessing. Therefore, those negtives that are very close or distant to a given possitive one, would not be sampled. The negtives too close to the learned hyperplane may have skewed hyperplane and far away from it could not be the support vector but be trained with uselessness. While for the ones separated by the distance close to the mean distance, their contributions are very high. The negtives which have the maximal entropy value with counterpart possitives are undersampled, in this way, the input data are no longer imbalanced. Thus the learned hyperplane is further away from the positive class. This is done in order to compensate for the skew associated with imbalanced datasets which pushes the hyperplane closer to the positive class.Given a query sequence, our algorithm starts by searching the local sequences database and generating a multiple alignment of all significant hits. The columns of the multiple alignments are analyzed using a variety of sources to define scores that reflect the domain-information-content of alignment columns. Information theory based principles are employed to maximize the information content. Besides we get a feature extracted from the conformational entropy of a protein sequence. Thus we get an imbalanced training data set. Next we resample the data set and form N population initialization in Genetic Algorithm. We test respectively the two sampling techniques: over-sampling on minority and under-sampling on majority. SVM learn on each re-sampling training data set and corresponding AUC value is computed. The population is updated by three basic genetic operators, such as reproduction, crossover, mutation, according to the fitness value of AUC. The process of SVM learning and genetic population updated is iterated until convergence or reaching the max iteration. A fuzzy classification system model based on support vector machine is proposed in this paper.As a powerful tool in dealing with complex uncertainty problems, Fuzzy System Theroies (L.A. Zadeh et al.) have been succeeding in many applications such as signal processing and pattern recognition.However, they often suffer from the curse os dimensionality for the high-dimentional data. SVM and Fuzzy Systems are complementary in such cases. Some researcher gave the equivalent relation proof on SVM and positive definite fuzzy classifier, which made it possible to combine SVM with Fuzzy Systems. Reduction methods are developed to minimize the complexity of the system by reducing the linguistic terms in the fuzzy rules based on the similarity of fuzzy sets, and removing the redundant and inconsistent fuzzy rules. Finally, the particle swarm optimization is used to adjust the system parameters for compensating the deviation caused by the reduction. Experimental results show that the methods are feasible and effective.
Keywords/Search Tags:Protein Sequence, SVM (Support Vector Machine), Protein Domain, Boundary, Imbalaced Data Learning, Fuzzy Systems, PSO(Particle Swarm Optimization), Maximum Entropy, Undersampling, Oversampling, ROC(receiver operating characteristic)
PDF Full Text Request
Related items