Font Size: a A A

Protein Sub-nuclear Localization Based On Feature Fusion And Dimension Reduction Algorithm

Posted on:2017-03-15Degree:MasterType:Thesis
Country:ChinaCandidate:S H LiuFull Text:PDF
GTID:2180330488466893Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the completion of human genome sequencing, high throughput sequencing technology has been becoming popular gradually. As a result, protein sequences have been quickly generated in abundance. Knowing protein function information becomes one hot research issue of bioinformatics. It is well known that biological activities that proteins execute are made within the cells. Thus we know that the information of protein sub-cellular and sub-nuclear locations is closely related to the protein function, and the research of protein sub-nuclear localization can provide effective clues for prevention, diagnosis and treatment of genetic diseases. However, the protein sub-nuclear localization of traditional biology experiment methods need consume large amounts of time and money. Recently, with the rapid development of computer science, the methods of using machine learning to research the protein sub-nuclear localization become a hotspot in the bioinformatics research. And the methods have higher prediction speed and lower cost than traditional methods. This paper deeply research the protein sub-nuclear localization issue based on machine learning methods.Firstly, we comprehensively elaborate the basic knowledge, the backgrounds and the significance of protein sub-nuclear localization problem, and the research status. Meanwhile, give the main content of the research in detail. Then discuss the protein sequence representations and classifications from different perspectives and sum up the existing problems of the current protein sequence representation methods. Finally put forward the innovation points of protein sub-nuclear localization studied in this paper.Propose one method of protein sub-nuclear localization based on features fusion representation and supervised locality preserving projection. Due to the traditional protein sequence representation methods being restricted to only single aspect sequence information to extract protein characteristics, and when designing classification model, the traditional representation without analysis the data distribution, it makes the relationship between sequence representation and classifier isolated. Thus, first the method proposes to fuse of the representations with complementary information and obtains a representation with high efficiency discriminant information. Then employ supervised locality preserving projection learning data for low dimensional manifold to deal with the proposed fusion representation. Obtain the low dimensional discriminant data with the features of between-class separating and inner-class preserving. Choose the K-nearest neighbor classifier to predict protein sub-nuclear location on the basis of the data distribution. Finally, conduct a variety of experiment in the standard data sets, and reap the high prediction accuracy. The method makes full use of the complementary sequence information of traditional sequence representation and considers the correlation of the data distribution of representation and classification model. This method improves the overall prediction accuracy. But this method ignores the differences of different protein located in different sub-nuclear locations. Thus put forward another innovative point of this paper.Propose the other method of protein sub-nuclear localization based on effective fusion representations and linear discriminant analysis. Different feature representation methods contain different sequence information and have different contribution degree for protein sub-nuclear localization. Proteins localized at different sub-nuclear locations have different function. Provide different fusion process for different sub-nuclear data and construct two kinds of high-dimensional fusion representations with efficient sequence information by refining the difference of proteins among each sub-nuclear location. Use genetic algorithm to seek out the feature combination coefficient of fusion representation according to each sub-nuclear location. As the fusion representations have the characteristic of high-dimensional and information redundancy, employ LDA processing our proposed representations and select the data dimension with which the protein sub-nuclear localization predictor can reap the highest prediction accuracy. Meanwhile develop the effective predictor. In two standard data sets, run numerous experiments and the results show that the proposed methods obtain high prediction accuracy, and the classifier has high the performance.
Keywords/Search Tags:protein sub-nuclear localization, fusion representations, Dimension reduction, supervised locality preserving projection, linear discriminant analysis
PDF Full Text Request
Related items