| Protein phosphorylation is a widely occurred post-translational modification in eukaryotes.It plays an important role in various biological processes such as energy metabolism,signal transduction pathways,neural activity,cell cycle and apoptosis.The high cost of traditional biological experiment methods such as mass spectrometry has prompted the vigorous development of phosphorylation site identification based on computational methods.Among them,the most widely needed non-kinase specific calculation method requires no kinase information and only residue sequence as input for prediction.However,this calculation method often makes different data compression strategies(such as random sampling)on the training data to reduce the complexity caused by the large amount of data to improve training efficiency.These data compression methods often lead to the loss of the original distribution characteristics of some samples.In addition,it is unscientific to directly participate in the algorithm training for unlabeled residue sites in the phosphorylation site database as negative samples.Therefore,how to solve the above two problems and design an effective prediction algorithm to predict the unknown phosphorylation sites is a very meaningful research area.Based on the above research questions,this thesis proposes two non-kinase specific phosphorylation site prediction methods.(1)Prediction of phosphorylation sites based on Kernel Fuzzy C-means clustering support vector machine.According to the granularity calculation,combined with the kernel fuzzy C-means clustering,this algorithm divides the particles in the high-dimensional feature space to obtain the equilibrium information particles representing the entire sample space.Then a granular support vector machine prediction model KFCC-GSVM is established based on balanced granular data.This model improves the rationality and reliability of data compression at phosphorylation sites.Therefore,when the traditional support vector machine algorithm is used for classification,the distribution of compressed data in the kernel space is the same as the pre-compressed data.Experimental results demonstrate that our method is better than the SVM-based non-kinasespecific phosphorylation site prediction method — Musite and the traditional GSVM method.In addition,independent data set testing proves the generalization performance of KFCC-GSVM.(2)Prediction of phosphorylation sites based on positive-unlabeled sample learning(LPU).K-medoids clustering and Spy technology were used to preprocess the data of unmarked phosphorylation sites,and the initial suspicious negative sample set was obtained.Then Ada Sampling adaptive sampling is performed on the set to obtain the final reliable negative sample.With the highdimensional kernel spatial similarity index,the integrated feature data is cyclically clustered and granulated to obtain information granules with more support vectors.Finally,the LPU-GSVM classification model is constructed based on the purity and density of information particles.The same experiment shows that the algorithm is more effective for the prediction of non-kinase specific phosphorylation sites.The case study also proves the ability of the algorithm to identify potential phosphorylation sites. |