| With the progress of the post-genome project and the vigorous development of high-throughput biological sequencing techniques,biological data continues to grow explosively.Biological computing has penetrated into all fields of biology.Taking the succinylation of proteins as an example,determine which lysine residues in an unknown protein sequence are succinylated.The use of traditional methods to solve this problem is mainly through the method of mass spectrometry,This approach would require an inordinate amount of time and a huge amount of human and financial resources.Therefore,a variety of computationally based methods have therefore been developed in recent years.These computational methods can Efficient identification of protein succinylation sites can assist biological experimenters in experimental research.In this paper,based on the protein sequence,combined with the annotation characteristics of succinylation site data,the identification method of succinylation site was studied in depth.The main points are summarised below.1.According to the annotation background of succinylation site data,this paper constructs a method based on Positive-Unlable Learning(PU Learning)to identify succinylation sites.In the computational methods for predicting succinylation sites,the succinylation sites that have been annotated are usually regarded as positive samples,and the remaining lysine sites without any succinylation annotation are regarded as negative samples.In fact,some negative samples may be positive.This method will produce false negative samples,thus,the prediction accuracy will be affected.To solve this problem,the PU bagging method is used to establish a new succinylation site prediction method PUL_Succ in this paper.The main steps of this method are: first,randomly select data from unlabeled samples and combine all positive samples for bagging training Classifier;then use the trained classifier to predict out-of-bag samples and record their scores;repeat the above steps to roughly classify each unlabeled sample.2.Aiming at the characteristics that protein sequences are sequential data in which amino acid letters are arranged in a certain order,this paper uses LSTM network and CNN network to construct a hybrid model Deep Succ to identify succinylation sites.First,combined with the previous feature coding evaluation work,this paper picked and chose five superior feature codes: one-hot,BLOSUM62,ACF,AAindex,and CKSAAP coding to characterize succinylation samples.Secondly,four network models LSTM-CNN,CNN-LSTM,LSTM,CNN are constructed using LSTM network and CNN,and then the selected five feature codes were respectively input into each of these four models for training to evaluation.Based on the performance of each model,the optimal model among them were chosen to construct a hybrid model Deep Succ that composed of five sub-modules for integrating heterogeneous information.The ten-fold cross-validation and independent test set results showed that Deep Succ has good robustness. |