Font Size: a A A

The Research On Gene Splice Site Prediction With Conditional Random Fields

Posted on:2013-01-18Degree:MasterType:Thesis
Country:ChinaCandidate:J CaoFull Text:PDF
GTID:2210330374462869Subject:Biological Information Science and Technology
Abstract/Summary:PDF Full Text Request
The research of the genome function attracts more attentions recent years, whichfocuses mostly on genome identification, function extraction and use of theinformation. While a certain monomer or "characters" exists exclusively in a specificposition in the gene sequence, digital characteristics can be used in scientificexperiments. Therefore the sequence model can be studied with probability theory.Probability prediction is a promising method to predict the unknown genome,because it is likely to reduce the cost of experiment with efficiency promoted. Asaccurately identifying gene splice site would helps to mark the genome and discovernew gene, the paper is devoted mainly to gene splice site problem. Thus combiningspecific function and structure difference in DNA sequence for identifying splice siteof exons/introns with probability prediction method is the major issue. The main workin the thesis is as follows:(1) Introduce the concept, structure, various prediction methods and tools of the genesplice site as well as the use of open data sources and method evaluation criteria.(2) Present a survey on gene splice site prediction methods based on the probabilitygraph model. The characteristics and the principles of two probability graph modelare then discussed.(3) Focusing mostly on Bayesian Network Model and Hidden Markov Model(HMM),directed probabilistic graphical model which is used in predicting gene splice site isinvestigated. The experimental analysis shows that Hidden Markov Model performsbetter than Bayesian Network Model in equilibrium between sensitivity andspecificity, however it is rather difficult to predict those confusable gene becauseindependent assumption of the output neglects the context.(4) Considering the limitation of graphical model, an undirected graphical model,which is Conditional Random Fields CRFs model, is studied. With visualization ofsequence identification, accumulated information of specific locus is obtained and most conservative GT-AG of splice site is highlighted, which is the basis fordetermining the parameters of the model. Combining improved parameter estimateoptimization algorithm IIS with a global feature extraction, the training time of themodel is shortened.(5) The performance of various prediction models is investigated respectively withevaluation criteria of Accuracy and ROC curve with receiver operating character. Theresult of experiment indicates CRFs performs better than others models to getappropriate splice site prediction. Thus it is a promise method for gene splice siteprediction.
Keywords/Search Tags:conditional random fields, splice site, pretreatment, sequence identifiesthe visualization
PDF Full Text Request
Related items