Font Size: a A A

Prediction Method Of Protein Glycation Site Based On Ensemble Deep Learning

Posted on:2022-08-20Degree:MasterType:Thesis
Country:ChinaCandidate:Y Z XuFull Text:PDF
GTID:2480306566967689Subject:Bioinformatics
Abstract/Summary:PDF Full Text Request
In the era of the rapid development of genome technology,proteomics have gradually become the focus of the researchers.As a basic macromolecule substance directly involved in physiological and biochemical activities,proteins usually need the process of post-translational modification(PTMs)before they can be transformed into mature proteins.After that,proteins can perform its function normally.Glycation is a very important type of PTMs.The process of protein glycation is a complex non-enzymatic process composed of a series of biochemical reactions.The final product of glycation is called AGEs(Advanced Glycation End products).AGEs can directly denature proteins or form ligands with proteins,which can mediate the inflammatory response of cells.At present,many studies have shown that AGEs are associated with many diseases,such as nephropathy,Alzheimers disease and atherosclerosis.The conventional experimental verification methods of mass spectrum cost a lot of manpower and time,so it is necessary to use a computer-aided method to predict protein glycation sites.In recent years,many machine learning methods have been applied to predict protein glycation sites.These methods have been able to predict glycation sites accurately,but there are still many problems to deal with.First of all,in the aspect of learners,most of the existing prediction methods use support vector machine as the basic learner.Because of its characteristics of hyperplane partition,support vector machine can effectively predict glycation sites on small dataset,but when the amount of glycation sites data increases gradually,it is incompetent in accuracy and training efficiency.Secondly,in the aspect of features,the existing methods only use the biological sequence features based on apriority knowledge,so the feature source is simplex,and the feature learning process of the existing methods highly depends on feature selection,which leads to the poor generalization of the existing model.Finally,at the data level,most of the existing methods use the under-sampling method of randomly selecting a part of negative samples to create a balanced dataset.This naive method of under-sampling will prevent the model from learning a comprehensive pattern of the negative samples,which will lead to the model unable to distinguish the difference between glycation and non-glycation sites.To sum up,there are still many defects in the existing methods.To deal with these problems,this paper proposes a protein glycation site prediction method with ensemble deep learning(Gly-EDL).Firstly,in the aspect of feature representation,Gly-EDL uses multiple sources of feature representation,including protein language model features and biological schema features,to obtain more comprehensive potential information of protein glycation sites.Secondly,in the aspect of feature learning,different feature learning strategies are adopted for different types of features,and attention mechanism is adopted as a method to integrate different types of features,so as to construct the model architecture of ensemble deep learning.Due to the above feature learning strategies,Gly-EDL model architecture has strong flexibility and efficiency,and has good performance in model transferability.Finally,in order to deal with the problem of imbalanced sample categories,Gly-EDL adopts weight-based resampling method(WBR)and data enhancement method for the existing datasets in the training phase of the model,which alleviate the problem of data deficient of glycation sites and imbalanced sample categories to a certain extent,then improves the robustness and transferability of the model.After adopting the above strategies,Gly EDL has achieved predominant performance in protein glycation site prediction compared with other existing methods,and has obvious advantages in generalization of the model.Finally,this paper proposes Gly-EDL method to predict protein glycation sites,so that researchers can more easily use computer-aided methods to preliminarily speculate whether glycation occurs or not,play the role of preliminary screening,so as to reduce the workload.In addition,Gly-EDL aims to provide a new solution for the prediction or classification problems related to biological sequences,so that more methods and strategies in other fields can be used in the field of biological sequence research.Accordingly,GlyEDL hopes to facilitate the work of more researchers.
Keywords/Search Tags:Post-translational modification, Protein glycation site prediction, Ensemble deep learning, Resampling, Data enhancement, Attention mechanism, Multi-feature fusion
PDF Full Text Request
Related items