Font Size: a A A

Development Of Ensemble Learning-based Prediction Method And Comprehensive Database For Deleterious Synonymous Mutation In The Human Genome

Posted on:2022-03-09Degree:DoctorType:Dissertation
Country:ChinaCandidate:N ChengFull Text:PDF
GTID:1484306542967359Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Synonymous mutations do not change the encoded amino acids and are considered nonfunctional for a long time.Advances in next generation sequencing technologies have detected numerous synonymous mutations in the human genome.In recent years,much work has pointed out the important role of synonymous mutations in many human diseases.Experimental characterization of all identified synonymous mutations is not practical and is usually timeconsuming,costly and labour intensive.To meet this need,several excellent bioinformatics platforms and tools have been proposed to support prioritization of synonymous mutations.Although these prediction methods have greatly promoted the development of the important field of synonymous mutations,there are still some limitations in the field of pathogenic synonymous mutation prediction.On the one hand,the predictive results of those tools are inconsistent and difficult to choose.On the other hand,although there are several computational methods proposed in the past years,the precise prediction of pathogenic synonymous mutation is still challenging.Nevertheless,the fragmentation and heterogeneity of available data and algorithms make it challenging to readily obtain the comprehensive information of pathogenic synonymous mutations.Due to these shortcomings,we have carried out systematic studies,and the detailed descriptions are as follows:(1)An ensemble framework for improving the prediction of deleterious synonymous mutation by feature representation learning.In this work,we explored multimodal features across four groups including functional score,conservation,splicing,and sequence features,and we then trained eight conceptually different machine learning classifiers for each of them,resulting in 32 base classification models.We further selected four base models referring to their prediction performance and the predictive probabilities of these base classification models were subsequently used as the input feature vectors of logistic regression classifier to construct the accurate method based on the ensemble framework and named as En DSM.The results suggested that En DSM achieved better performance comparing with other state-of-the-art predictors on the training and independent test datasets.The En DSM server interface along with the benchmarking data sets are freely available at http://bioinfo.ahu.edu.cn/En DSM.(2)Comparison and integration of computational methods for deleterious synonymous mutation prediction.This study systematically compared 10 computational models(including specific methods for pathogenic synonymous mutation and general methods for single nucleotide mutation)in terms of the algorithms used,calculated features,performance evaluation and software usability.In addition,we constructed two carefully curated independent test datasets and accordingly assessed the robustness and scalability of these different computational methods for the identification of deleterious synonymous mutations.In an effort to improve predictive performance,we established an integrated tool,named Prediction of Deleterious Synonymous Mutation(Pr DSM),which averages the ratings generated by the three most accurate predictors.Our benchmark tests demonstrated that the ensemble model Pr DSM outperformed the reviewed tools for the prediction of deleterious synonymous mutations.The Pr DSM website is available at: http://bioinfo.ahu.edu.cn:8080/Pr DSM/.We hope that this comprehensive survey and the proposed strategy for building more accurate models can serve as a useful guide for inspiring future developments of computational methods for deleterious synonymous mutation prediction.(3)Construction of a more comprehensive database of human disease-causing synonymous mutations.The architecture of db DSM v2.0 is composed of two layers.In the one layer,we manually curated DSMs from ?18,000 abstracts and scrutinized the full text for more than 1,000 literatures.Compared with the db DSM v1.0,the db DSM v2.0 provides more detailed information about pathogenic synonymous mutations,including transcripts and variant information in Human Genome Variation Society nomenclatures.In another layer,we also added new annotation fields across six categories,including functional score,conservation,splicing,translation efficiency,transcription factor binding site and sequence features.Furthermore,we used the voting method to integrate the scores of six categories of features to evaluate the impact of all possible synonymous mutations in the human genome and then we combined the putatively deleterious synonymous mutations with a high degree of confidence into db DSM v2.0 for the convenience of web application.Finally,the scoring system was applied to 28 cancer types derived from TCGA,and potential prognostic biomarkers in several cancers were discovered.The detailed information of db DSM v2.0 can be obtained from http://bioinfo.ahu.edu.cn:8080/db DSM/index.jsp.
Keywords/Search Tags:Deleterious Synonymous Mutation, Ensemble Learning, Prediction Model, Pathogenicity Prediction, Database Construction
PDF Full Text Request
Related items