Font Size: a A A

Studies And Applications Of Some Biological Sequence Analysis Algorithms

Posted on:2011-03-08Degree:DoctorType:Dissertation
Country:ChinaCandidate:Y Z ZhaoFull Text:PDF
GTID:1100360305966767Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the completion of the Human Genome Project, we get a large mount of biological data. Many mathematical problems arise from the analysis and processing of these biological data. Efficient algorithms are required to be designed to solve these problems.DNA and protein studies are two basic subjects in the study of molecular biolgy. We study some problems in the analysis of DNA and protein sequence. Haplotype is the sequence of some special DNA sites. Haplotype can be used to locate the probable genome associate with common diseases. Haplotype analysis plays an important role in the disease association study. Motif is a certain kind of fragment of DNA. Motif finding is critical to the study of the gene transcription and regulation of the gene expression. The function of a protein is determined by its structure. The protein structure can be predicted through its sequence, which can be further used to the virus detection and drug design. We study some problems about the haplotype, motif and protein structure and design some algorithms for them. The main contributions include:(1) Proposing a Haplotyping Algorithm in Population Data Acquiring haplotype data from biological experiments is usually very time consuming and expensive. Alternatively the haplotypes can be inferred from genotype data using computational methods, which is called haplotyping. Many haplotyping algorithms have been presented. Block partition-ligation strategy is widely adopted in these algorithms to improve the efficiency. In the traditional haplotype block partition-ligation strategy, the block partition is uniform. However, many studies show that the haplotype has certain block structure. We estimate the linkage disequilibrium among different SNPs. By employing dynamic programming and greedy algorithm, we propose a more reasonable haplotype block partition-ligation strategy to improve the accuracy of haplotyping.(2) Proposing a Haplotyping Algorithm in Pedigree Data Recently some new biological techniques are developed to generate a new type of biological data called xor-genotype. Compared with the traditional genotype data, xor-genotype is less informative but more economic. Some studies have been performed on the haplotyping in population based on the xor-genotype data. However the related study in pedigree is rare according to our knowledge. We transform this problem to a traditional problem in graph theory called graph realization. Compared with population data, pedigree structure provides more constraints, which leads to a high probability to gain a unique solution.(3) Proposing a Sequence Motif Finding Algorithm Motif plays an important role in the transcription factor binding and protein-protein interaction. Motif finding will be very helpful to our understanding of the genome function. Planted (l, d) motif finding problem is a traditional mathematical problem. Unfortunately it has been proved to be NP hard. Many algorithms have been proposed. Because of its NP-hardness, the running time of exact algorithms are usually very long. We proposed a more efficiency exact algorithm by employing hash table and pruning strategy.(4) Proposing a Secondary Structure Prediction Algorithm The determination of protein structure is critical to our understanding of the protein function. The traditional protein secondary structure prediction methods are usually based on homology sequence comparison. Through the NMR experiments, we can gain the chemical shift informetion of the atoms in each amino acid of a protein. Using the chemical shift information, we proposed a new protein secondary structure prediction algorithm. At first this algorithm use KNN method to predict the secondary structure. Then the BCJR algorithm is employed to smooth the prediction result.Classified by the research content, the innovations of this thesis are:(1) Haplotyping in Population According to the haplotype block structure in the human genome, we propose a more reasonable haplotype block partition-ligation algorithm. This algorithm can be used in the haplotyping problem in population to improve the accuracy of the haplotyping result.(2) Haplotyping in Pedigree We propose an algorithm with polynomial time complexity to solve the haplotyping problem in pedigree based on the xor-genotype data. Compared with the xor haplotyping in population, the pedigree structure provides more constrains to this problem, which increases the chance to gain a unique solution.(3) Sequence Motif Finding We present a new sequence motif finding algorithm. A perfect hash function is designed to map the solution space. Pruning strategy is also employed in our algorithm to reduce the average time complexity. Compared with the existed algorithms, our algorithm is much more efficient.(4) Protein Secondary Structure Prediction Using the chemical shift information and protein sequence, we employ KNN method to predict the protein secondary structure. BCJR algorithm is used to smooth the prediction result to further improve the accuracy. Compared with other algorithms, our algorithm gains better result.
Keywords/Search Tags:bioinformatics, single nucleotide polymorphism(SNP), haplotyping, motif finding, protein secondary structure prediction
PDF Full Text Request
Related items