Font Size: a A A

The Study On Classification And Prediction For High Dimensionality Biological Data

Posted on:2010-05-06Degree:DoctorType:Dissertation
Country:ChinaCandidate:T WangFull Text:PDF
GTID:1100360302466670Subject:Pattern Recognition and Intelligent Systems
Abstract/Summary:PDF Full Text Request
In recent years, the number of biological data is expected to grow at an even faster pace than the years before due to the rapid development of the biological techniques. For example, the emergence of microarray technology has greatly accelerated the pace of biological discovery, and gene expression data with thousands of feature are produced. How to effectively analyse these high dimensional data causes more and more extensive concern of researchers. Similarly, a huge protein dataset are represented by a large number of high-dimensional feature vectors. However, the earliest method proposed by researchers to describe the protein sequences is simple such as amino acid composition method. The extracted featrues contain a small amount of sequence information, and the number of features is not very high. As the studied is deepened, various kinds of physical and chemical properties of the amino acid have been considered in representation methods to describe the protein. Pseudo amino acid composition method is one of these methods. Recently, with the establishment and completeness of various kinds of protein databases, the new feature expression methods are proposed by using the protein database information and evolving information. The typical methods are dipeptide compostion, PSSM, function domain compostion, GO and so on. As the representation methods are improved continuously, the number of extracted features increases from dozens to hundreds, even thousands. The redundancy and correlation of data increases as the number of features raises. So, a lot of problems will be brought such as the increasing computing time and classifier complexity. To solve this problem, the researchers begin to do some researches on developing methods to effectively reduce the redundancy of data and calculation complexity. The feature selection algorithm, so called the dimension reduction method, is one of the effective ways to solve this problem. By using the dimesion reduction method, the redudance of data is reduced and the most important information in original data is kept. Many experiment results prove that the prediction system is simplified and classification performances are improved by adopting dimension reduction methods.The linear subspace dimension reduction method is adopted to predict the classification of the protein data in this paper. Its validity can be verified by the evaluation method. However, the shortage of linear dimension reduction methods is that it is unable to reveal the nonlinear structure contained in data set. Many real data sets contain the essential nonlinear structure. For example, the biological datasets used in this paper are a kind of complicated nolinear structure data. In order to remedy the deficiency of linear dimension reduction methods and effectively reveal the inherent nonlinear structure of the data, the linear subspace prediction methods are generalized to high dimensional feature space. The protein data classification prediction method based on the kernel methods is developed. However, people can not directly understand kernel function. The proposed manilod learning algorithm has remedied this deficiency. Recently, a new dimension reduction method, so called maximum variance projection method, is proposed to solve this problem by combining the advantage of manifold learning algorithm and linear dimension reduction method. Experimental results show that higher accuracy is obtained by adopting this method in protein prediction. According to the disadvantage of Isomap algorithm, an improved algorithm MDM-Isomap (Minimax Distance Metric-Isomap) has been proposed. The validity of this algorithm can be verified by the face recognition experiment.The main contributions of this thesis are shown as bellow:1. This paper proposes the linear subspace dimension reduction method to predict the protein subcellular localization and quaternary structure. Firstly, the features are extracted from protein sequences by the sequence encoding method. The dimension of extracted feature vectors is generally very high. It can cause the"high-dimension disaster". The direct negative influence is that the process of predicting protein subcellular localization of gram-negative bacterial will be more complicated. Then, we adopt the linear subspace dimension reduction method to solve this problem by extracting the important and lower features vectors, and then identify potential novel protein classes based on the reduced lower dimensional feature vectors. Finally, experimental results show that the prediction accuracy obtained by linear dimension reduction method is higher than the ones obtained without this method and the prediction system is simplified at the same time.2. The linear dimension reduction method only applies to the data with the linear structure. This assumption is too harsh. Because many real data sets contain the essential nonlinear structure such as the biological data. To cope with this problem, the kernel dimension reduction methods that putting forward the linear subspace method to the higher dimensional feature space is proposed. This kernel dimensionality method has been successfully applied to the protein subcellular localization. The prediction accuracy obtained by the nonlinear dimension reduction method is higher than the ones obtained by linear subspace dimension reduction method.3. Breakthrough point of combining two methods together can be found by mining the advantages of manifold learning method and linear DR algorithm. Maximum variance projection method, an algorithm by fusing linear and nonlinear method is proposed to predict types of membrane proteins. The idea of MVP is to preserve the local information by capturing both the between-class geometric properties and the within-class geometry properties of the feature space. Compared with traditional LDA algorithm, the advantage of MVP lies in that it considers the geometry structure information of the sample space. Compared with the basic manifold learning method, it has the ability of discrimination which can solve the classification problem, especially the classification problem of the membrane protein. 4. After comparing the results obtained by various kinds of dimensionality reduction algorithms for protein dataset prediction, we summarized the advantage and disadvantage of these manifold learning algorithms.According to the shortcoming of Isomap algorithm, we have proposed an improved algorithm MDM-Isomap. Based on this new Minimax Distance Metric, the essential characteristic of manifold can be reflected by choosing the proper nearest neighbours. The validity of this algorithm can be verified by the face recognition experiment. The prediction accuracy obtained by MDM-Isomap is higher than the ones obtained by original Isomap method. 5. In order to greatly speed up the conversion and implication of research fruits, protein subcellular localization prediction web server is constructed. Through Internet, the scholars from all parts of the world can use the web service.
Keywords/Search Tags:Pattern recognition, High dimensionality biological data, Linear dimensionality reduction, Non-linear dimensionality reduction, Manifold learning, Protein sequence, Classification and prediction, Feature extraction, Subcellular localization
PDF Full Text Request
Related items