Font Size: a A A

Research On Protein Coding Regions Prediction Algorithms Based On Filtering Theories And Statistics Of Characteristics Of DNA Sequences

Posted on:2014-03-13Degree:DoctorType:Dissertation
Country:ChinaCandidate:Y T MaFull Text:PDF
GTID:1220330422968050Subject:Signal and Information Processing
Abstract/Summary:PDF Full Text Request
The deoxyribonucleic acid (DNA) structure of eukaryotic is much morecomplicated than that of the prokaryotic. In the past decades, the accuracy of proteincoding regions (exons) prediction is far from satisfaction for the needs of annotationof a newly sequenced DNA sequence or genome. In this thesis, the protein codingregions prediction algorithms taking the digital filters, the test of statisticalsignificance and the Fisher discriminant analysis (FDA) respectively as the kernelhave been explored and studied for improving the accuracy. The main contributionsare listed as follows:First, starting from the study of the sliding discrete Fourier transform (SDFT),the multi-sliding-window periodogram (MSWP) based protein coding regionsprediction algorithm is proposed. The longer the SDFT window the better the tripletbases periodicial (TBP)signal is extracted, while the shorter the SDFT window thebetter the TBP signal is located in relative base position. The MSWP algorithmconjoins both merits of the long and short SDFT windows and makes a good tradeoffbetween frequency and time domain preiscions.Second, taking the finite impulse response (FIR) or infinite impulse response(IIR) filter with linear phase function within the pass band, a narrow pass-band filter(NPBF) based exons prediction algorithm is presented. The algorithm provides muchbetter accuracy than the other independent algorithms. This is because the side effectsof group delay of FIR and IIR filters are depressed, and the orders of the FIR filtersand the parameters of IIR filters are chosen carefully according the predictionexperiments. A moving average filter is also used, which smoothed the grass powerspectrum density curve greatly. The proposed NPBF algorithm is suitable for bothFIR and IIR filters. The all-phase theory is first applied in gene prediction by designof a all-phase NPBF. A two-fold threshold method is also introduced to improve thesensitivity of the prediction algorithm to the coding regions with lower power spectraldensity.Third, the relationship between the mapping method and the prediction accuracyis studied using the NPBF algorithm. Many different mapping methods have beenproposed and great progress has been made during the past decades. The relationship between mapping method and the prediction accuracy has rareley been studied on alarge enough DNA sequence data set. The relationship study on the HMR195and theALLSEQ provides a fair verification that shows the Voss and the Z curve methods arethe best mapping methods. The results provide a good reference for the comingstudies.Fourth, a recently proposed prediction algorithm based on t-test and z-test isexplored. The results turn out that the algorithm is good at visualizing the differencebetween the coding regions and the non-coding regions, and provides very highprediction accuracy for the DNA sequence with long coding regions and shortnon-coding regions. Although the accuracy of the algorithm for the DNA sequencewith short coding regions and long non-coding regions is far from acceptable, it is avaluable research project to improve the algorithm to make it acceptable.Finally, a suggestion is proposed in choosing the threshold for the DNAsequences classification algorithm based on FDA and Z curve. There are at least fivedifferent thresholds available for the FDA in separating the coding sequences from thenon-coding sequences. The most suitable threshold is put forward according to theseven-fold cross-validation clustering experiments results.The contributations listed above are helpful in improving the accuracy, andprovid some valuable research results and references for solving the real geneprediction problems.
Keywords/Search Tags:Protein coding regions prediction, Multi-sliding-window periodogram, Linear phase narrow pass-band filter, All-Phase filter, Mapping method, TZTprediction algorithm, Fisher discrimant analysis
PDF Full Text Request
Related items