Font Size: a A A

Research On Protein Classification Based On Mixed Features

Posted on:2021-02-02Degree:MasterType:Thesis
Country:ChinaCandidate:X Q RuFull Text:PDF
GTID:2370330629950530Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Proteins play an important role in the activities of organisms.A variety of proteins play an irreplaceable role in the body by virtue of their unique structures and functions.Protein recognition is the first step to explore the biological functions of proteins.Protein recognition can lay a solid foundation for subsequent research.Since the implementation of the Human Genome Project,the number of protein sequences of unknown structure and function is growing rapidly.Traditional biological experiments to classify and identify protein sequences have been unable to meet the needs of speed in today's era,so many researchers have applied machine learning algorithms to protein classification.At present,in protein classification and recognition,there are still some problems: the imbalance of the data set categories;the protein sequence information is not well expressed in digital form;there are invalid features in the feature set or there is redundancy between features;the inappropriate classification algorithm is selected.This study summarizes these problems in existing models,and explores the classification of phage proteins and electron transport proteins.In the classification and recognition of phage proteins,this paper extracts protein information by integrating information from multiple angles,and realizes the complementarity of information between the respective types of feature sets in the form of feature combination.Then,this study uses feature selection algorithm to select features which have strong correlation with labels and low redundancy with other features,and ranks the selected features in feature set.Under the random forest algorithm,the optimal feature subset of each type feature set is obtained by calculating the performance index of each dimension feature added to the feature set.Finally,it is verified through comparison experiments that the model based on sequence basic information and structure information performs better than the model based on a single type.In addition,this article verifies the superiority of the model proposed in this paper from the aspects of feature extraction methods and classification algorithms.In the classification and recognition of electron transport proteins,the model proposed in this study not only performs well,but also has fast calculation speed.This paper only builds models based on the first 4-dimensional features extracted by the DT algorithm.In this part of the study,the imbalanced datasets were first processed through the algorithm which ideas similar to the EasyEnsemble algorithm.Then,evolution information and frequency distribution information are taken into account in feature extraction.Based on four feature extraction algorithms,a total of 40 sub-models are constructed,and the classification model that is most suitable for this study is sequentially explored.In addition,by observing the positive and negative example numerical distributions of valid features in this study,it is concluded that the larger the difference between the positive and negative data distributions,the better the model classification effect.
Keywords/Search Tags:phage protein, electron transport protein, feature extraction, classification and recognition
PDF Full Text Request
Related items