Research On Haplotype Phasing Algorithm For Long Sequences

Posted on:2015-01-22

Degree:Master

Type:Thesis

Country:China

Candidate:W H Pan

Full Text:PDF

GTID:2250330428999747

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

The haplotype is an important class of information in biological genetics. Since the experimental methods have the shortcomings of high-cost and slow, it has become the first choice of reserchers to obtain the haplotypes from cheap genotypes by computational methods. This computational problem called haplotyping has been a fundamental problem in genomics. With the development of genomics, longer and longer haplotype sequences are needed, some of which has even more than one million sites. No existing methods can process sequences with more than100,000SNPs on personal computers, so the data volume challenges the community with algorithms that can phase extremely long sequences. In this dissertation, we will conduct our research in designing a fast haplotype phasing algorithm for sequences with more than100,000SNPs using very limited memory. The main research content and contributions are as follow.(1) An Efficient Haplotype Phasing Algorithm for Extremely Long Sequences DatasetsWinHAP2.0is a fast and precise large-scale haplotyping algorithm introduced our research group in recent years. This dissertation improves WinHAP2.0in two aspects:(1) improve the segment-merging strategy;(2) parallellize WinHAP2.0. The experimental results show that improved WinHAP2achieves significant reducement in segment-merging switch error rate for20%to30%. Compared with other methods, WinHAP2achieves significant improvements in running speed and memory requirement, with better or comparable precision. WinHAP2can phase500genotypes with1,000,000SNPs using just12.8MB in memory on a personal computer, whereas the other programs require unacceptable running times. The parallelized WinHAP2.0gets similar linear speedup(2) An Efficient Haplotype Phasing Algorithm for Large-scale Extremely Long Sequences DatasetsSince WinHAP1.0and2.0run kind of slow when the number of sequences is very large, this work improve WinHAP by introducing clustering and present a clustering based haplotyping algorithm named CbWinHAP. CbWinHAP clusters sequences first by similarity and then phase them respectively. This strategy maintains the precision doesn’t reduce and the time and memory consumption reduce extremely. The experimental results show that CbWinHAP improves running speed and space efficiency with several orders of magnitudes by comparing with WinHAP with the same accuracy. To further improve running speed, we parallelize CbWinHAP by OpenMP programing model. The experimental results show that parallelized CbWinHAP gets similar linear speedup.

Keywords/Search Tags:

large-sacle computing, SNP site, haplotype, genotype, haplotypephasing, extremely long sequences

PDF Full Text Request

Related items

1	Research On The Triploid Individual Haplotype Reconstruction Problem
2	Optimization Models And Algorithms Of The Haplotype And Genotype Problems
3	Large-scale Flow In Inclined Thermal Convection
4	Bioinformatics Analysis Of The Chimeric Sequences Generated In Multiple Displacement Amplification And Its Potential Use In Haplotype Assembling
5	Large Deviations And Strong Law Of Large Numbers For Some Dependent Random Sequences
6	Sequence Recognition And Hotspot Selective Preference Research Of Chimeras And Its Application Exploration In Haplotype Analysis
7	Systematic Biology Of The Order, Families, Genera Of The Umbilicariales Based On The Phenotype And Genotype And Their Significance For Systematic Biology In Genera
8	Statistical Methods For Haplotype Analysis With Genotyping Errors
9	Transcription Factor Binding Site Prediction Based On DNA Sequences
10	The Research On The Extremely Large Magneto-Resistance (XMR) In Non-magnetic Semimetals