Font Size: a A A

Inference and analysis of the human splicing cod

Posted on:2017-08-27Degree:Ph.DType:Thesis
University:University of Toronto (Canada)Candidate:Xiong, Hui YuanFull Text:PDF
GTID:2460390011969644Subject:Molecular biology
Abstract/Summary:
We construct and analyse a computational model that predicts the outcome of alternative splicing by recognizing features in RNA sequences. The computational model can be viewed as a "splicing simulator'' for a range of healthy human tissues. It takes as input a pre-mRNA sequence surrounding a possibly alternatively spliced exon and estimates the inclusion level of that exon in mature RNA, after splicing occurs. The model is trained using a supervised machine learning framework where the training examples are the alternatively spliced exons, the feature vectors are derived RNA sequences near these exons, and the targets are their corresponding splicing outcomes in healthy individuals. The model is inferred from over 15 million DNA elements derived from the human reference genome and encoded as 1300 numerical RNA features, 10689 alternative exons mined from RefSeq and EST databases and RNA-Seq data from 16 healthy human tissues. A Bayesian ensemble of neural networks capable of accounting for combinatorial effects of RNA features is used to learn the relationship between the RNA features and the splicing outcomes. By identifying combinations of functionally important DNA elements, the model accounts for 65% of the variance in the inclusion level of out-of-sample test exons.;By learning genome-wide patterns that relate RNA sequences to splicing on the reference genome, we found the model is capable of generalizing to new genetic contexts and predicting splicing outcome for novel sequences. We applied the model to analyze the effects of more than 650,000 intronic and exonic variants on splicing. We observed that disease-associated mutations disrupt splicing much more often than common mutations, revealing previously unknown potential diseases mechanisms. Surprisingly, these splicing-disrupting mutations are not limited to mutations at splice sites. Many deep intronic mutations are also predicted to disrupt splicing. In focused studies on mutations related to spinal muscular atrophy and Lynch syndrome, we found our computational predictions have good agreement with previously identified effects of splicing-disrupting mutations that were found in independent biological experiments. In a focused study on autism spectrum disorder, we found that mutations with large effects on splicing are significantly more concentrated in brain related genes in autism patients compared to control subjects.;This thesis is a step towards using artificial intelligence and large amounts of genomic data to automatically model the complex cellular mechanisms that read and process DNA. In our opinion, computational models constructed using this approach will bring significant value to genomic medicine, because they can model biological mechanisms and can be used for a wide range of sequences. As a result, the cellular effects of mutations can be predicted even if the mutation has not been observed before. This ability can be used for genetic diagnostics, studying the effects of complex diseases, and searching for treatments. In addition, it is anticipated that these computational models will improve with the growing size of genomic data data available for training.
Keywords/Search Tags:Splicing, Model, RNA sequences, Computational, Human, Mutations, Data
Related items