Font Size: a A A

Similarity Analysis Of Biological Sequences And Gene Identification Based On Signal Processing Techniques

Posted on:2012-11-19Degree:DoctorType:Dissertation
Country:ChinaCandidate:S Y WangFull Text:PDF
GTID:1480303389966379Subject:Circuits and Systems
Abstract/Summary:PDF Full Text Request
Bioinformatics is a new interdiscipline. With the aid of computers and internet, bioinformatics deals with biological macromolecules including nucleic acid and protein etc., according to the theories and methodologies from mathematics and information science. The research on bioinformatics can help us explore some serious problems about biological evolution and life inbeing. In addition, the huge knowledge hidden in life can also accelerate the development of other disciplines.This dissertation is aimed at exploring the applications of signal processing techniques in bioinformatics. The main research focuses on similarity analysis of biological sequences and gene identification.The main results obtained can be summarized as follows.?Since the structure information on RNA secondary structure is mainly composed of base pairs, we construct base sequences from sequences of RNA secondary structure based on base pairs. With the help of the principles of orthogonal projection and wavelet transform, base-pair transform on the obtained base sequences is then designed. Then, based on the designed base-pair transform, the similarity function between sequences is constructed for comparing RNA secondary structures. The similarity function combines the difference between the transformed results of two sequences with the difference between the associated locations. Therefore, the similarity function can comprehensively compare difference of sequences, and can be applied to similarity analysis of RNA secondary structure. This proposed method for similarity analysis has lower time complexity. In addition, the difference among the results obtained by this method is larger, which can help to further implement cluster analysis of the obtain results.?Based on Hamming distance of information theory, a universal bilateral similarity function is proposed to implement similarity analysis of biological sequences including DNA, RNA secondary structure and protein. With no requirement of numerical mapping of biological sequence, the proposed method with lower time complexity, contains much information of biological sequences, and unify the methods for similarity analysis of three kinds of biological sequences. Simulation results fully show the validity and universality of the bilateral similarity function. Especially for RNA secondary structure, based on the proposed similarity function, the results with consideration of structure information is consistent with the ones without consideration of structure information, which can simplify the procedure of similarity analysis of RNA secondary structure.?Based on the principle of symbolic dynamics, a novel representation method for DNA sequence is proposed. This proposed representation method with the feature of visualization, bears better numerical characteristic which can help to find the chaotic characteristic of DNA sequence. The visualization feature of the proposed method can implement graphical alignment, codon alignment of DNA sequence. Based on the results of codon alignment, a similarity percent between sequences is constructed for effectively implementing similarity analysis of DNA. Based on the characteristic vector composed of the geometrical centers, the proposed method can also implement similarity analysis of DNA, effectively. It is shown from the obtain results that the principle of symbolic dynamics can be applied to sequence analysis of DNA, effectively.?Combined with the difference between the sequences of RNA secondary structure and DNA, the representation method for DNA based on symbolic dynamics is modified for RNA secondary structure. The starting point is that the structure stabilization of RNA secondary structure is mainly decided by the free energy of base pairs. The influence of truncated length on the results of similarity analysis is discussed emphatically. In time domain, combined with matrix invariants, the modified method can implement similarity analysis of RNA secondary structure, quantificationally. In frequency domain, the qualitative analysis is made to further validate the modified method. Simulation results show that the principle of symbolic dynamics can also be effectively applied to similarity analysis of RNA secondary structure.?Combined with the representation methods based on symbolic dynamics and Z curve for DNA, the period-3 feature of protein coding region is utilized to design a gene identification model based on extended Kalman filter. With the help of the prediction ability of extended Kalman filter, the proposed model can effectively identify the location of gene exons. In order to reduce the background noise, a window operation is performed after the proposed model, which can further improve the identification results of coding and noncoding regions of gene.
Keywords/Search Tags:Similarity Analysis, Gene Identification, Base-pair Transform, Bilateral Similarity Function, Symbolic Dynamics
PDF Full Text Request
Related items