Font Size: a A A

Applied Research Of Data Mining Technology In Prediction Of Protein Subcellular Location

Posted on:2009-11-11Degree:MasterType:Thesis
Country:ChinaCandidate:Y ZouFull Text:PDF
GTID:2120360242976645Subject:Pattern Recognition and Intelligent Systems
Abstract/Summary:PDF Full Text Request
Bioinformatics study shows that protein can participate in all kinds of life activities of cells only when they are transfer to the right location, there would be great effect to the function of cell or even life if something error happened. In addition, protein are not stillness, they often bring into play under subcellular environment. We can get the help of inferring the function of protein by knowing the location information of subcellular, and moreover, the research of protein function can help us understand the structure of subcellular deeply. Therefore, the protein subcellular location information become more and more important that those traditional lab technology are no longer sufficient, for they are often costly, time-consuming. Recently, a lot of research and meaningful results in bioinformatics have received. Tthe construction of database and the analysis and prediction of subcellular location accelerated the research of protein structure and function, and subcellular location is one of the key function feature. Facing the growing data of subcellular, data analysis weight greatly. Finding the bioinformatics rules of subcellular location and making sure the protein function is what we really concern. So , to develop a computer aided protein subcellular location predictin method is the key problem. And this is the very purpose of this thesis.Analysis sequence feature of subcellular location related can offer useful information to prediction. Based on this important principle, a compter aided prediction of subcellular location using fusion algorithm is devised. First, all the related proteins in the human sperms are obtained from the SWISS-PROT database and divided to training-subset and test=subset. Second, the feature information vector are extracted from these two data sets. Third, prediction of the subcellular location using the method we proposed. Finally, evaluate the prediction results.Two key problems have to be resolved in this detection procedure. The first one is how to extract the efficient feature. The second one is how to effectively predict subcellular location, especially multi-label location. This thesis carries out deep research into the above two problems.In the first problem, thesis try to find the efficient features by analyzing the protein component information, physical and chemical characteristic between proteins, Gene Ontology and motifs in detail.The second problem, prediction of subcellular location, is the core content of this thesis. For the complexity of protein function, the predict of subcellular location is always an difficulty, however, it is possible to use machine learning methods to promote prediction accuracy. Three contributions have been made in this area. First, containing multi-label location information data sets of human cell are created for machinel learning approaches. Second, fusion algorithm of predicting subcellular lacation based on improved Dempster-Shafer is proposed. By fusion multi-feature, predicton accuracy is higher than methods using single feature. Third, this thesis make research on interesting phenomenon of multi-label location, which is always not contained in past research. For our method can consider multi-feature integrated, it can mining the multi-location information, and therefore, yield the accuracy prediction results.New contributions of this paper are as follows:(1) Established stricthuman sperm subcellular dataset containing multi-label location information from SWISS-PROT.(2) Adopted newly developed GO discrete model to express protein sequence, results show that it is an effectively feature and prediction accuracy is improved greatly.(3) Proposed the improved Dempster-Shafer fusion algorithm to predict the subcellular location. By fusing the overall and partial PseAA, GO representation and the motifs, more accuracy prediction results can be yield.(4) Research on multi-label location phenomenon, the algorithm this paper proposed can integrate all useful feature, and mining all the location information, and therefore predicting the multi-location which protein really belongs.
Keywords/Search Tags:subcellular location, location prediction, Dempster-Shafer evidence theory, support vector machine, bioinformatics, human sperm, Gene Ontology
PDF Full Text Request
Related items