Font Size: a A A

Research On Recognition Methods And Associated Features Of Intron Retention

Posted on:2016-04-30Degree:DoctorType:Dissertation
Country:ChinaCandidate:R MaoFull Text:PDF
GTID:1220330482455133Subject:Agricultural Electrification and Automation
Abstract/Summary:PDF Full Text Request
Alternative splicing allows creation of many distinct mature mRNA transcripts from a single gene by utilizing defferent splice sites. It is a major source of gene expression regulation and protemome diversity. Next-generation sequencing technology is opening fascinating opportunities in the life sciences. With massive RNA-seq and their meta-data(e.g.,including environmental treatments, developmental stages and sampled tissues) are becoming available, more novel alternative splicing evens will be evident in plants, in which previous studies that argued alternative splicing happened less frequently. What’s more, in mammals,most alternatively spliced genes possess exons that are entirely spliced out, and intron retention is the least prevalent form of alternative splicing, in contrast, in plants more introns have their retention in mature mRNAs. Current studies on retention introns pay less attention to clear recognition of RIs and systematic identification features of RIs. As a model plant,Arabidopsis has a rich souce of the genome annotation and the RNA-seq of transcriptome,based on this data, this thesis addresses a detailed analysis and research on intron retention,the most common type of alternative splicing in plants. The central work and results of this thesis as follows:1)The research on two types of retained introns(RIs) and constitutively spliced introns(CSIs) identification algorithms. The first algorithm is based on TAIR10 gene annotation and sequences files. The RIs and CSIs are identified by comparing the coordinates of introns in genome sequences. Another algorithm is based on RNA-seq providing expression informations of transcripts. The RIs and CSIs can be discerned by a series of steps, which include data preprocessing with CLC, mapping short RNA-seq reads with gsnap,transcriptome reconstruction with cufflinks, merging assemblies with cuffmerge, calculating the FPKM of each transcripts and testing differential expression and regulation between all pairs of samples with cuffdiff2. Meanwhile, this algorithm refactors the first algorithm, and then calculates and records FPKM of all RIs and CSIs by FPKM of transcripts. Compared with the existed algorithm, the algorithm we designed could reduce redundancy of RIs,correct misjudgment of CSIs and produce the latest database of RIs and CSIs. In this database,4856 RIs(1384 identified RIs existing in TAIR10 and 3472 novel identified RIs) and 58436 CSIs are identified and recorded based on RNA-seq data coming from developmental stages and sampled tissues and abiotic treatments. 2262 RIs are mined from our biotic stress RNA-seq. Among of them, 675 RIs are recorded in TAIR10 annotation file and 1587 RIs are novel identified.2)The research on new hybrid feature extraction approach to systematic classification of RIs from CSIs. This feature extraction approach combines three aspects: local and global nucleotide sequence features of introns, frequent motifs and biological features. Using random forest and PSOSVM to differentiate thess two types of introns in Arabidopsis, we demonstrated that our proposed feature extraction approach was more accurate in effectively classifying RIs and CSIs in comparison with other four feature extraction approaches.3)The research on key factors that affect classification of RIs from CSIs. Expression level of transcript is a key feature for RIs. When we only integrate this feature into our feature vector set, classification performances using Random Forest and PSOSVM are not as good as expected. So, we further analyzed the relative expression level of RIs. The gene with retained introns always includes two types of transcripts. One is transcript with RIs, the other is transcript without RIs. Here, we employed Rirate to illustrate relative strength of FPKM between transcript with RIs and transcript without RIs(. And then, the new positive set was built using the novel standard with positive Rirate. Random Forest and PSOSVM classifiers were adopted for classified forecast. The accuracy was increased from 0.741 and 0.653 to0.928 and 0.892 respectively, and the best AUC value was up to 0.985. The increasing of classification performance showed that the new Rirate is a key factor for classification of RIs from CSIs.4)The research on RIs under abiotic and biotic stress in Arabidopsis. Firstly, we used high throughput RNA-seq technology to explore the transcriptomes using the treatments of TMV and anti-virus medical-Polysaccharide Krestin(PSK) and obtained information of RIs.Secondly, significant differentially expressed genes related with RIs were identificated. Lastly,the GO analysis of these genes showed that they were important for metabolism process,abiotic and biotic stress responses process, protein kinase activity and adenyl nucleotide binding. These results implied that RIs was a regulatory mechanism for Arabidopsis to response abiotic and biotic stress.5) The research on typical features of RIs in comparison with CSIs in Arabidopsis.Following-up analysis of the different intron sets we obtained has revealed interesting information. The conservative motif(“YTRAY”) near the branch site was more diffcult to discover in RIs than in CSIs. In average RIs have higher GC content and lower signal strength of 5’ and 3’ splice sites than CSIs. At the same time, RIs also show higher similarity with theirflanking exons than CSIs. We here propose ag/ga-rich motifs like “gaag”,“gaga”,“agag”,“agga” as intron splicing suppressor. Accordingly, tttt-containing motifs seem to be intron splicing enhancer.
Keywords/Search Tags:Alternative splicing, Retained Introns, Constitutively Spliced Introns, Random Forest, PSOSVM, The New Hybrid Feature Extraction Approach
PDF Full Text Request
Related items