| With the development of society,human health and public health problems are facing great threats,researchers are committed to looking for new treatment methods for diseases.Therapeutic peptides become the research hotspot of bioinformatics due to their advantages of small side effects and multiple-selectivity.At present,the methods based on biological sequence and machine learning are important to studying and predicting of the structure and function of peptide sequence in this field.Many researchers have proposed models for therapeutic peptide prediction by using machine learning and other methods.In this paper,several specific problems of therapeutic peptide prediction are studied from the perspective of feature extraction,classification algorithm,establishment of prediction model and data set construction.In order to quickly and accurately predict the function of peptides,this paper proposes to use machine learning method to establish an effective prediction model based on peptide sequence information.In order to solve the dimensional disaster caused by different feature spaces and the impact of spatial redundancy on the prediction performance,this paper proposes a multi-source fusion feature learning prediction model.In order to better carry out peptide prediction research,the problem is transformed into a graph classification problem,and a peptide prediction method based on graph convolution neural network is proposed.In addition,this paper further explores the detectability of polypeptide sequence.Through the verification and analysis of the detectability of peptides,we can realize the detection of peptides from proteomics,and then realize its functional analysis.Contents of this research include anti-cancer peptide feature extraction,ACPred-Fuse model,anti-cancer peptide prediction based on graph convolution neural network,and the prediction of peptide detectability.This paper mainly includes the following aspectsFirstly,this paper introduces the framework of peptide prediction model based on machine learning,and introduces the classification algorithms commonly used in the framework,including support vector machine,random forest,artificial neural network,graph volume neural network.These classification algorithms are the basis of this paper’s research on prediction model.By comparing the different application scenarios and advantages and disadvantages of different classification algorithms,this paper proposes several models by using several kinds of classification algorithms.Finally,the performance evaluation methods and indicators of the prediction model are introduced,which provides an effective basis for verifying the effectiveness of the model proposed in this paper.Secondly,for the prediction of therapeutic peptides,this paper first studies the current situation of therapeutic peptide databases,through which researchers can obtain the sequence,physical and chemical properties of related peptides.In order to effectively establish the prediction model,we need to extract its features in advance.In this paper,we extract the features of the sample data from multiple perspectives such as sequence,structure and relevance,specifically involving 14 feature extraction algorithms.Combined with the information provided by peptide database and multiple angle multiple peptide feature extraction,it provides an important data basis for the subsequent prediction and research of peptides.Thirdly,in the field of protein prediction,machine learning method is more popular.In order to further improve the prediction performance of the method,this paper proposes a multi-source feature fusion learning method.Through multiple angle feature extraction and feature fusion optimization,the prediction model ACPred-fuse is established.This model is explored by fusing 29 different sequence based feature prediction methods.Then the multiple view features are optimized to form an optimal feature combination.Finally,the feature combination is used to train the optimal prediction model.In terms of machine learning algorithm,by using different classification algorithms for experimental comparative analysis,the random forest algorithm is finally selected as the classification algorithm of the model.The construction of the model is divided into three steps: feature extraction,feature representation learning and optimization,and fused feature learning and feature representation optimization.Comparing with the existing prediction models,ACPred-fuse has better prediction performance.Fourthly,for ACPred-fuse model,we need to first extract the peptide features,and then train the model through these features.Feature extraction and feature pool building are relatively complex,and they are not suitable for large-scale data.To overcome the complexity of the above model.This paper proposes a prediction model based on graph convolution neural network.Peptide prediction is regarded as a graph classification problem,and each peptide sample is regarded as a graph.Firstly,the peptide data set and its amino acid sequence data need to be extracted,and then the unique hot coding is used to represent the characteristics of the peptide.Secondly,the distance between each sample is calculated,and the adjacency matrix is established to construct the peptide and amino acid graph.Then,the graph convolution neural network is used to train the data set,and the classification results and different evaluation indexes are obtained after minimizing the loss.The cross entropy loss function is used to optimize the classification results.The experimental results show that the peptide prediction model based on graph convolution neural network has better predictability.Fifthly,in peptide prediction feild,peptide detectability is an important research field.In order to overcome the shortcomings of existing methods,this paper proposes an end-to-end transform method which based on Siamese Network to predict and improve peptide detectability.This method only needs the peptide sequence,and does not need the physical and chemical properties of the peptide and other existing experimental results.By using the transformer and gated recurrent unit architecture,it can automatically learn context sensitive embedded representation from the peptide,so as to fully capture the global and local information to represent the detectability of the peptide.In this paper,a new loss function is introduced into the model,which effectively improved the generalization ability of the model.Through experimental comparison,the proposed Pep Former has better performance in prediction accuracy and generalization ability.More importantly,this method can automatically learn and explore the non-discriminatory information in the sequence without any prior knowledge and the help of manual feature engineering. |