Font Size: a A A

Research On Feature Extraction Algorithm And Dataset Construction Technology In Membrane Protein Classification

Posted on:2011-09-20Degree:MasterType:Thesis
Country:ChinaCandidate:C CengFull Text:PDF
GTID:2120330338990046Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
As one of the main components of biomembrane, membrane proteins play a vital role in organisms. Membrane proteins are the main manifestations of biomembrane's function, and make the material basis for cells to implement various functions. Moreover, recent research reports indicate that the structure or function change of some membrane has extremely close relations with the production of human beings' diseases, and the relevant receptor membrane proteins also become an important target for drug design. That is why this thesis focuses on the membrane proteins.The Human Genome Project (HGP) raised in the early 1990s has got tremendous achievements under the united efforts of scientists all over the world. Meanwhile the Genomics and Proteomics have accomplished a great development. Nowadays, with the unprecedented quantity growth of biological data, bioinformatics, a new method based on computer technology, is taking the place of the traditional means.Predicting the respective types of membrane proteins through their primary sequences to gain the correlative advanced structure and function information, is a crucial fundamental research in the study of the structures and functions of membrane proteins. This important and challenging work will also provide clues for conquering the special biological problems, which is our goal too.The construction of the dataset of membrane proteins is the foundation and premise of the whole prediction model, its quality influences the accuracy of the algorithm, is one of the dominant elements in the research of membrane proteins classification. Feature extraction of membrane protein sequences is another basic technique in the research of protein classification based on calculation, and also a key factor of the classification performance. This thesis collects the membrane proteins sequences from the latest release of SWISS-PROT to build a newer, more comprehensive and evenly dataset according to the common dataset CE2059 and CE2625 construction standards. From the membrane proteins' primary sequences, this thesis studies the classification problem for membrane proteins' structures and functions, proposes a new feature extraction algorithm based on the new dataset, further tests and analysis of the feature extraction algorithm are undergoing too. The main work in this thesis is summarized as follows:(1) Construction of the new dataset for membrane proteins. The construction of the dataset is one of the dominant elements in the research of membrane protein classification. The common used datasets CE2059 and CE2625 in this field are almost based on the SWISS-PROT Release 35 in 1997. As the development of the databank, the number, scale and annotations of membrane protein sequences are renewed regularly, indicating the significance and necessity of the construction of a new dataset with these latest data. The thesis builds up a larger and more evenly new dataset according to the common dataset construction criterions of the standard datasets from the latest SWISS-PROT Release 57.0 in 2009, providing an important and necessary preparation of the further study.(2) The feature extraction algorithm is another key process in this field. In order to get a classification model with better prediction accuracy and further mine the information of structures and functions in the membrane protein sequences, this thesis considers further the physical and chemical properties of amino acid residues and long distance correlation between them, constructing a novel type of membrane proteins classification model which combines two feature classes and support vector machine algorithm (SVM), encompassing the AAC and several indexes of the residues from the amino acid index database. Under three typical tests(Self-consistency, Jackknife and Independent dataset), the accuracy rate of prediction is respectively 96.78%, 91.03% and 86.93% based on the membrane protein new dataset mentioned above. Compared with existing models, the prediction method gets a good performance and a notable improvement.
Keywords/Search Tags:membrane protein, feature extraction, amino acid index, correlation coefficient, dimensions optimization, dataset construction standards
PDF Full Text Request
Related items