| Protein sequence is the main undertaker and material basis of life activities,participates in various life processes,and contains a wealth of information.With the continuous development of the Human Genome Project and the in-depth research of high-throughput sequencing technologies,a large number of protein sequence data have been acquired.However,the gap between the number of proteins with known sequences and the quantity of proteins with known functions continues to widen,making it difficult to meet the needs of biological research.Therefore,protein function prediction has become an important and extremely challenging research topic.It can not only help people explore the origin of life and genetic variation,but also help understand the pathogenesis of major diseases,provide an important theoretical basis for the diagnosis and prevention of diseases.Traditional experimental methods to predict protein function are expensive,time-consuming and can’t be carried out on a large scale.The emergence of computational methods can make up for some of the shortcomings of experimental methods,and it has become one of the research hotspots of bioinformatics in the post-genome era.The research content of this subject is centered on protein sequence data,by extracting the potential pattern information from the protein sequence,and finally using the method of machine learning to identify protein function.It mainly explores the commonly used methods of extracting feature vectors from biological sequences,feature selection algorithms,machine learning algorithms,and frequently used indicators for evaluating the performance of classifiers at the current stage.Through reading a large number of literatures,we summarized the general process of using computational methods to predict protein sequence functions,and proposed effective prediction schemes for two different protein function prediction problems.The main research results are as follows:(1)Regarding the prediction of Type Ⅲ secreted effectors(T3SEs),a prediction model based on a position-specific scoring matrix(PSSM)is proposed.We first extract three feature vectors based on the PSSM matrix,and then use the e Xtreme Gradient Boosting(XGBoost)algorithm for feature selection to remove redundant information.Finally,we use the Support Vector Machine(SVM)algorithm to train the prediction model and evaluate its performance.The results on the independent dataset show that our method uses fewer features to achieve higher prediction accuracy than most of the proposed methods and can be used as a powerful tool to identify T3SEs.(2)For the identification of Antioxidant Proteins(AOPs),a prediction model based on ensemble learning was proposed to predict AOPs by integrating multiple protein-coding strategies.First,three kinds of feature vectors are extracted based on the PSSM matrix,and three base classifiers are obtained by using SVM to train the three kinds of feature vectors.Then the prediction results of each base classifier are combined into a new feature vector,and SVM is used to train again to obtain the final prediction model.The evaluation results show that some models based on a single feature perform better than the ensemble model on the training dataset and independent dataset.Therefore,we speculate that not all ensemble models will perform better than models based on a single feature.The key is to choose the right feature.In addition,compared with the existing prediction methods,our method is superior to the existing methods in most performance indicators,which is expected to become an effective prediction model for AOPs.Through the analysis of the above experimental results,it can be found that the two methods we proposed show excellent performance.Compared with some methods in the same field,the performance has been greatly improved,which can provide a new research idea for protein function prediction in the field of bioinformatics. |