Font Size: a A A

Prediction Of Protein SUMO Modification Sites Based On Cost-sensitive Learning

Posted on:2021-05-03Degree:MasterType:Thesis
Country:ChinaCandidate:S S YeFull Text:PDF
GTID:2370330620968136Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the explosive development of science and technology,human beings have made great progress in many fields.The implementation of the Human Genome Project and the maturity of next-generation gene sequencing technologies have also generated massive biological data.How to use the latest technology to mine the biological information behind the data is of great significance for the development of biology.Post-translational modification(PTM)is a regulatory process involving the alteration of the original chemical composition of a protein.It regulates protein functions and arranges a great number of cellular processes by adding modified gene groups,such as phosphate,glycosyl,ubiquitin,and fatty acyl groups,to one or several amino acid residues.A small ubiquitin-related modifier(SUMO)is a significant and unique type of PTM,which regulates a substrate's functions mainly by altering the intracellular localization or other types of post-translational modifications.SUMO protein is a member of the ubiquitin-like protein family which can affect the stability of the protein,enzyme activity,and protein interaction.It is important to identify SUMOylation sites in prokaryotes or eukaryotes to better understand various diseases such as cancers and Alzheimer` s disease.Data imbalance problem is very common in bioinformatics classifications,where the number of positive samples is much less than the number of negative ones.Which also exists in the prediction of SUMO protein modification sites.It is easy to be infected by data imbalance for machine learning algorithms.Though various computational methods for predicting SUMOylation sites have been developed,they didn`t perform well on the imbalanced datasets and the true positive rate is low.This paper first analyzes the feature extraction method for protein sequences proposed by the predecessors,proposes a new feature based on the biochemical characteristics of amino acids at special positions and compares different features,and selects the best feature combination to be input into the machine learning model.Experimental results show that this new feature helps to improve the accuracy of the model.Aiming at the problem of data imbalance,this paper uses AdaCost algorithm and genetic algorithm to improve the performace of the prediction of SUMO protein modification sites based on cascade forest.Experimental results prove that these measures can greatly alleviate the problem of data imbalance.Compared with existing methods,our method can not only greatly improve the true positive rate,but also performances well in accuracy(Acc),specificity(Sp),Matthew correlation coefficient(MCC),and area under the curve(AUC).This article also explores the application of deep learning,especially convolutional neural networks in the prediction of SUMO protein modification sites.Different loss functions are used in experiments.The results of the experiments are analyzed and the reason that performance of convolutional neural networks is poor,which points out the direction for future research.
Keywords/Search Tags:SUMO protein, Cost-Sensitive Learning, Cascade forest, AdaCost, Genetic Algorithm, Convolutional neural network
PDF Full Text Request
Related items