Font Size: a A A

Analysis And Prediction Of Lysine Malonylation Sites Based On Exploiting Informative Features And Ensemble Learning Model

Posted on:2020-04-25Degree:MasterType:Thesis
Country:ChinaCandidate:R P XieFull Text:PDF
GTID:2370330599459763Subject:Engineering
Abstract/Summary:PDF Full Text Request
Post-translational modification refers to the chemical modification of one or more amino acids by bonding functional groups(alkyl,alkenyl,phenyl,etc.)to change their chemical properties or spatial structure and further affect the regulatory role of proteins in the process of cell life activities.Among many post-translational modifications of proteins,lysine malonylation is a chemical modification that transfers malonyl groups from malonylCoA to lysine residues.This modification regulates the metabolism of glucose and fatty acids in liver tissue and is associated with metabolic diseases with high morbidity such as type 2 diabetes and obesity.Therefore,accurate identification of lysine malonylation sites from biological sequences is helpful to understand the pathogenesis and treatment of related diseases.In this work,an integrated learning framework for accurate prediction of lysine malonylation sites is proposed based on experimental validation data.The main work is as follows:(1)Collection and preprocessing of data sets for lysine malonylation.First,the malonylation-modified protein sequences were collected from public databases,and the 25-length residue sequences were intercepted with lysine(K)as the center.If the central lysine(K)was malonylated,it was defined as positive samples,otherwise it was defined as negative samples,to construct a high-quality data set of lysine malonylation sites for machine learning modeling.In addition,this work explores the differences between positive and negative sample sequences by sequence alignment.It is found that there are a large number of regional overlaps between positive and negative samples.It is necessary to explore all-round features of the sequence and find out the potential differences between positive and negative samples to construct a high-precision prediction model.(2)Feature extraction and feature selection of residue sequence for lysine malonylation.In order to extract key patterns and features from the residues of lysine malonylation sites,11 different feature coding methods were reviewed,analyzed and compared to generate a total of 2275-dimensional original feature vectors.The information gain feature selection algorithm is used to rank the importance of the original features.And the optimal feature set for each species were explored by training random forest model based on 10-time 10-fold cross-validation.(3)Construction of integrated learning model.In this paper,based on four common machine learning methods(random forest,support vector machine,K-nearest neighbor and logical regression)and a recently proposed LightGBM(Gradient Lifting Decision Tree)algorithm,we trained the data of three species(Escherichia coli,mice and humans)using the optimal feature set to construct several single machine learning models.It is found that the integration of single machine learning model can further improve the robustness and prediction accuracy of the model.Finally,compared with the existing state-of-the-art predictor(MaloPred)on the independent test set,the optimal ensemble models were more accurate for all three species(AUC: 0.930,0.923 and 0.944 for E.coli,M.musculus and H.sapiens,respectively).(4)The development of online prediction server.Based on this integrated model,we developed a high concurrent and load-balanced online prediction server(http://kmalsp.erc.monash.edu/)using Gearman task distribution framework to provide preliminary screening services of lysine malonylation sites for a wide range of research groups.We hope that this comprehensive survey and the proposed strategy for building more accurate models can serve as a useful guide for inspiring future developments of computational methods for post-translational modification site prediction,expedite the discovery of new malonylation and other post-translational modification types and facilitate hypothesis-driven experimental validation of novel malonylated substrates and malonylation sites.
Keywords/Search Tags:bioinformatics, lysine malonylation, feature extraction methods, ensemble learning, online prediction server
PDF Full Text Request
Related items