Font Size: a A A

Database Construction And Computational Prediction Of Cancer Driver Indels

Posted on:2020-03-27Degree:DoctorType:Dissertation
Country:ChinaCandidate:Z Y YueFull Text:PDF
GTID:1360330575465157Subject:Bioinformatics
Abstract/Summary:PDF Full Text Request
Next-generation sequencing identifies a good deal of variants per human genome,of which only a few may underlie genetic diseases.Mutations that significantly impact the structure or function of disease-related genes can result in a variety of diseases,such as,neurodegenerative disorders,immunodeficiency diseases and various types of cancer.Today,cancer is one of the most important factors in the death occurring in humans.Some key cancer-associated genes have been identified as causative.Additional mutation analysis of these genes,which have an important role in the ability to induce the diseases,is necessary to develop effective diagnostic and medical treatment methods.Identifying mutations that contribute to cancer development is a critical step in understanding tumor biology and developing targeted therapies.Mutations that provide a selective growth advantage,and thus promote cancer development,are termed as driver mutations,and those that do not are termed as passengers.At present,there are multiple advanced cancer mutation databases that curate data on different types of mutations,including indels(insertion/deletion).Identifying molecular cancer drivers by computational prediction method is critical for precision oncology.However,some problems still exist which limit the development of this field1.In terms of development of databases,while recent advances in next-generation sequencing technologies have enabled the creation of a multitude of databases in cancer genomic research,there is no comprehensive database focusing on the annotation of driver indels yet.And,while recently emergent driver mutation data sets are available for developing computational methods for predicting cancer mutation effects,benchmark sets focusing on passenger mutations are largely missing.2.In terms of prediction methods,while recent advances in methodology for identifying drivers in cancer genome,there is no prediction tool focusing on the cancer frameshift indels(insertion/deletion)yet.In addition,existing pathogenic frameshift indel predictors maybe suffer from plenty of missing values because of different choices of transcripts during the variant annotation processes.Owing to these drawbacks,there is a lack of investigation of driver indel patterns and prediction algorithms.In this paper,we constructed the database of cancer driver indels and studied the prediction method,details are as follows:1.We have constructed the database of Cancer driver InDels(dbCID),which is a collection of known coding indels that likely to be involved in cancer development,progression or therapy.dbCID currently contains experimentally supported and putative driver indels derived from manual curation of literature.Using the data stored in dbCID,we characterized features of driver indels in four levels(gene,DNA,transcript and protein)by comparing with putative neutral indels from VariSNP,the 1000 Genomes Project(1000GP)and the NHLBI Exome Sequencing Project(ESP6500),respectively.We found that most of the genes containing driver indels in dbCID)are known cancer genes playing a role in oncogenesis.Different from what was expected,the sequences affected by driver frameshift indels are not longer than those by neutral ones.In addition,the frameshift and inframe driver indels are more likely to disrupt high-conservative regions both in DNA sequences and protein domains.This database is freely available online at http://bioinfo.ahu.edu.cn:8080/dbCID.2.We proposed a prediction model,CIDPredictor(Cancer driver InDels Predictor),that can accurately discriminate cancer driver from passenger frameshift indels.More specifically,we built a random forest classifier with k-mer counts purely based on DNA coding sequences.The results on the independent test set indicated that this method outperforms other widely used non-cancer-specific methods in distinguishing known cancer driver frameshift indels from passengers(area under the ROC curve?0.939).Furthermore,because only sequence-based features are used,CIDPredictor can always return a result when an indel,together with a coding sequence,is inputted,thus effectively avoiding missing values.CIDPredictor is freely available online at http://bioinfo.ahu.edu.cn:8080/CIDPredictor.3.Three imbalanced data processing methods are adopted to respectively study the prediction methods of cancer driver frameshift indels and solve the imbalance classification problem.Based on the semi-supervised learning method,an AUC(area under the ROC curve)of 0.942 and sensitivity of 0.830 were obtained on the independent test set.Based on the method that combined with undersampling and ensemble learning,an AUC of 0.944 and a sensitivity of 0.832 were obtained.Based on the method that combined with oversampling and synthetic data filtering technique,an AUC of 0.936 and a sensitivity of 0.813 were obtained.The specificities of the three methods were consistently 0.999,and the sensitivities were all significantly higher than that of the prediction model without using imbalanced data processing methods(a sensitivity of 0.801).Among them,the optimal results were obtained by combining undersampling and ensemble learning.4.We used biological features to proposed a XGBoost-based prediction model,PredCID(Predictor for Cancer driver InDels).Through data preprocessing methods such as removing sequence homology,selecting the close-by driver and passenger frameshift indels in the genome,the positive and negative samples of the training set reached 1:1.In terms of features,we used eight biological characteristics from four different levels(gene,DNA,transcript and protein)to construct a classifier that can accurately distinguish cancer driver and passenger frameshift indels.On the independent test set,the area under the precision-recall curve(AUPR)and area under the ROC curve(AUC)were 0.969 and 0.980,respectively,which outperformed other widely used non-cancer specific prediction tools.PredCID is freely available online at http://bioinfo.ahu.edu.cn:8080/PredCID.5.We developed a comprehensive literature-based database of Cancer Passenger Mutations(dbCPM),currently containing 941 experimentally supported and 978 putative passenger mutations derived from a manual curation of the literature.Using the missense mutation data,the largest group in the dbCPM,we investigated patterns of missense passenger mutations by comparing them with the missense driver mutations.The results indicated that the missense passenger mutations showed significant differences with drivers at multiple levels,and exhibited pleiotropic functions depending on the tumor context.Using the missense passenger mutations in dbCPM,we assessed the performance of four cancer-focused mutation effect predictors.Although all the prediction tools displayed good true positive rates,their true negative rates were relatively low due to the lack of negative training samples with experimental evidence.Finally,we explored the passenger insertion/deletion mutation data in dbCPM and analyzed the characteristics of them.dbCPM is freely available online at http://bioinfo.ahu.edu.cn:8080/dbCPM.
Keywords/Search Tags:cancer, driver mutation, insertion/deletion(indel), database, prediction algorithm
PDF Full Text Request
Related items