Font Size: a A A

Research On Haplotype Phasing Algorithm For Long Sequences

Posted on:2015-01-22Degree:MasterType:Thesis
Country:ChinaCandidate:W H PanFull Text:PDF
GTID:2250330428999747Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
The haplotype is an important class of information in biological genetics. Since the experimental methods have the shortcomings of high-cost and slow, it has become the first choice of reserchers to obtain the haplotypes from cheap genotypes by computational methods. This computational problem called haplotyping has been a fundamental problem in genomics. With the development of genomics, longer and longer haplotype sequences are needed, some of which has even more than one million sites. No existing methods can process sequences with more than100,000SNPs on personal computers, so the data volume challenges the community with algorithms that can phase extremely long sequences. In this dissertation, we will conduct our research in designing a fast haplotype phasing algorithm for sequences with more than100,000SNPs using very limited memory. The main research content and contributions are as follow.(1) An Efficient Haplotype Phasing Algorithm for Extremely Long Sequences DatasetsWinHAP2.0is a fast and precise large-scale haplotyping algorithm introduced our research group in recent years. This dissertation improves WinHAP2.0in two aspects:(1) improve the segment-merging strategy;(2) parallellize WinHAP2.0. The experimental results show that improved WinHAP2achieves significant reducement in segment-merging switch error rate for20%to30%. Compared with other methods, WinHAP2achieves significant improvements in running speed and memory requirement, with better or comparable precision. WinHAP2can phase500genotypes with1,000,000SNPs using just12.8MB in memory on a personal computer, whereas the other programs require unacceptable running times. The parallelized WinHAP2.0gets similar linear speedup(2) An Efficient Haplotype Phasing Algorithm for Large-scale Extremely Long Sequences DatasetsSince WinHAP1.0and2.0run kind of slow when the number of sequences is very large, this work improve WinHAP by introducing clustering and present a clustering based haplotyping algorithm named CbWinHAP. CbWinHAP clusters sequences first by similarity and then phase them respectively. This strategy maintains the precision doesn’t reduce and the time and memory consumption reduce extremely. The experimental results show that CbWinHAP improves running speed and space efficiency with several orders of magnitudes by comparing with WinHAP with the same accuracy. To further improve running speed, we parallelize CbWinHAP by OpenMP programing model. The experimental results show that parallelized CbWinHAP gets similar linear speedup.
Keywords/Search Tags:large-sacle computing, SNP site, haplotype, genotype, haplotypephasing, extremely long sequences
PDF Full Text Request
Related items