Font Size: a A A

Exon Splicing Enhancer Identification Using Random Forests

Posted on:2011-08-03Degree:MasterType:Thesis
Country:ChinaCandidate:Y BaiFull Text:PDF
GTID:2120330338981047Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Recently, functional genomics has become one of the hottest topics with the completion of human genome project. Many researchers tend to study the alternative splicing of pre-mRNA, as it is essential to protein function. The exon splicing enhancers binded with SR proteins are known to promote alternative splicing by activating their nearby splicing sites. Hence, it is of great significance to identify the exon splicing enhancers from vast exon sequence data. There exist many notable methods for exon splicing enhancer identification.However, they are limited in three main aspects. First, the training data, especially the negative training data, are not well defined and extracted. Second, either sequence data or SR protein information is employed for classification. No existing work considers both although they are both important to exon splicing enhancers. Finally, existing algorithms are of great space and time complexity. Therefore, we are motivated to design a more efficient classifier for exon splicing enhancer identification.In this thesis, we have designed a novel classifier based on both sequence data and SR protein information. Compared with previous work, we have made the following four contributions:First, instead of heuristic biology information, we take random sequence data as the negative training data, which minimizes the false negative (noise) of the negative data, thus increases the prediction precision. Second, we are the first to combine the sequence data and SR protein information in exon splicing enhancer identification.Third, we adopted the decision tree and random forests, and studied the parameters to optimize the classification. Moreover, we employed cross-validation to avoid over-fitting in classifier training.Finally, we experimented with the human DNA sequences and compared our algorithm with the most recent work. The experimental results show that our classifier could deliver a precision of 95.16%, which is much better than the best existing exon splicing enhancer recognition precision 90.74%. Moreover, our algorithm is space and time efficient.
Keywords/Search Tags:exon splcing enhancer, SR proteins, alternative splicing, disition tree, random forests
PDF Full Text Request
Related items