Pooled Sequencing:Experimental Design And Data Analysis

Posted on:2018-12-24

Degree:Doctor

Type:Dissertation

Country:China

Candidate:C C Cao

Full Text:PDF

GTID:1310330515958285

Subject:Biomedical engineering

Abstract/Summary:

PDF Full Text Request

The concept of DNA sequencing could date back to 1950s.The first appeared sequencing method is presented for the determination of nucleotide sequence in polyribonucleotides.Over the past decades,owing to rapid advances in the sequencing technologies,the cost of DNA sequencing has been reduced by over several orders of magnitude.By means of the commercially available next-generation and third-generation sequencing technologies,the cost of sequencing a human genome has been reduced to about $1000.Currently,DNA sequencing technology is heading to higher throughput,longer sequencing read and lower cost.Nevertheless,given the factory-scale,it is obvious that many research questions cannot be addressed by whole-genome sequencing of individuals despite the plunging cost for sequencing.The main challenge exists in individually amplifying and creating sequencing libraries for thousands of samples.To efficiently use the capacity of sequencer and reduce the cost of sequencing library construction for large-scale sequencing,multiple individuals could be pooled together and sequenced,called pooled sequencing(pool-seq).The main limitation of the naive pool-seq strategy is its inability to obtain the information for each individual sample participated in the pool.Fortunately,multiplexing using sequencing Barcodes could overcome the drawback.Due to the short length of reads produced by current high throughput sequencing technology,Barcode mush be very short,leading to the scenario that the number of samples encoded by Barcodes are often limited.Besides,barcoding every sample remains costly at present.In 2009,Patterson et al.put forward a new kind of pool-seq strategy defined as combinatorial pooled sequencing.In combinatorial pooled sequencing,samples are mixed into few pools according to a carefully designed pooling strategy where the pooling patterns instead of DNA Barcodes are used to tag samples.Combinatorial pooled sequencing allows the sequencing results to be decoded to identify the reads that belongs to each sample in the population.Hence,combinatorial pooled sequencing involves two more steps than normal pool-seq:encoding and decoding.The encoding step refers to the design of the pooling strategy which should guarantee distinct pooling pattern for each sample.While the decoding procedure is utilized to analyze the pooled sequencing results to obtain the reads that belong to each sample according to the unique pooling pattern for each sample.In this thesis,we mainly focused on the experimental design and data analysis for pool-seq,especially for combinatorial pooled sequencing.We first optimized the combinatorial pooled sequencing and applied it to screen rare variant carriers,identify rare haplotype carriers and reconstruct single individual haplotypes.Taking advantage of dual mononucleotide addition based pyrosequencing,we also developed a method to efficiently identify de novo SNPs in pooled samples and conducted a real pool-seq experiment to validate the feasibility of this method.Works contained in this thesis are listed as follows:1.We proposed optimized combinatorial pooled sequencing designs for the screen of rare variant carriers.We first formulated a model to compute the optimal depth for sufficient observations of variants in pooled sequencing,as well as the cost model for the combinatorial pooled sequencing.Utilizing pooling designs from the field of group testing,appropriate parameters for combinatorial pooled sequencing could be selected to minimize cost and guarantee accuracy.Due to the mixing constraint and high depth for pooled sequencing,results showed that it was more cost-effective to divide a large population into smaller blocks which were tested using optimized strategies independently.Utilizing the optimized combinatorial pooled sequencing,the cost for screening variant carriers with frequency equaled 1%in 200 diploid individuals dropped to 52%where the target sequencing region was set to 30 Mb.To further use the quantitative information contained in the pooled sequencing results,based on quantitative group testing,we proposed a combinatorial pooled sequencing strategy that allows the efficient recovery of variant carriers in numerous individuals with much lower costs.We used random k-set pool designs to mix samples,and optimized the design parameters according to an indicative probability.Subsequently,a heuristic Bayesian probability decoding algorithm was designed to identify variant carriers.Finally,we conducted in silico experiments to find variant carriers among 200 simulated Escherichia coli strains.With the simulated pools and publicly available Illumina sequencing data,our method correctly identified the variant carriers for 91.5-97.9%variants with frequency ranging from 0.5 to 1.5%.Comparisons with compressed sequencing and combinatorial pooled sequencing strategies on the basis of qualitative group testing revealed that this strategy had better performance,especially in reducing the required data throughput and cost.2.We put forward a method to estimate haplotype frequency from pooled samples and applied it to screen rare haplotype carriers.Taking advantage of databases that contain prior haplotypes,we presented Ehapp to estimate the frequencies of haplotypes from pooled sequencing data.Ehapp first translated haplotype frequency estimation as finding a sparse solution for a system of linear equations and utilized the algorithm for sparse signal reconstruction in the field of compressed sensing to solve the equations.Simulation results showed that Ehapp could estimate the frequencies of haplotypes with only about 3%average relative difference for pooled sequencing of the mixture of 10 haplotypes with total coverage of 50×.When unknown haplotypes existed,Ehapp could maintain excellent performance for haplotypes with actual frequencies higher than 0.05.Comparisons with present method on simulated data in conjunction with publicly available Illumina sequencing data indicated that Ehapp is state of the art for many sequencing study designs.By means of Ehapp,we also demonstrated the feasibility of applying combinatorial pooled sequencing to identify rare haplotype carriers cost-effectively.On the basis of Ehapp,we further presented Ehapp2 to infer haplotype frequency from pooled sequencing data,in which local haplotypes spanning a genome region with fixed length rather than SNPs are taken as units.Furthermore,an Expectation-Maximization algorithm was employed to calculate the proportions for local haplotypes,which could utilize sequencing quality to reduce the effect of sequencing errors.Simulation experiments reveal that Ehapp2 was robust to sequencing errors and able to estimate the frequencies of haplotypes with less than 3%average relative difference for pooled sequencing of mixture of real Drosophila haplotypes with 50× total coverage even when the sequencing error rate was as high as 0.05.Owing to the strategy that proportions for local haplotypes spanning multiple SNPs were accurately calculated first,Ehapp2 retained excellent estimation for recombinant haplotypes resulting from chromosomal crossover.Comparisons with present methods revealed that Ehapp2 had better performance and was more suitable for current massive parallel sequencing.Ehapp and Ehapp2 for the Linux platforms are available at http://bioinfo.seu.edu.cn/Ehapp and http://bioinfo.seu.edu.cn/Ehapp2.3.We proposed a clone-based haplotyping method by combinatorial pooled sequencing.Given a clone library for an individual,clones were pooled according to a random size-k design.By means of the distinct pooling pattern for each clone in the overlapping pool sequencing,alleles for the recovered variants could be assigned to their original clones precisely.Subsequently,the clone sequences could be reconstructed by linking these alleles accordingly.Finally,HapCUT was employed to assemble haplotypes by linking reconstructed clones.To verify the utility of our method,we conducted an experiment in silico to assemble the haplotype sequence for the chromosome 1 of the individual NA12878.Ultimately,112 haplotype contigs were assembled with an N50 length of 3.4 Mb and no switch errors.Comparisons with current clone-based haplotyping methods indicated our strategy was more accurate.To make our method easier to use,OPShap(in Perl)for encoding and decoding,with detailed instructions,is available online at http://bioinfo.seu.edu.cn/OPShap.4.Taking advantage of dual mononucleotide addition based pyrosequencing,we presented Epds-a method to efficiently identify SNP from pooled DNA samples.On the basis of only five patterns of difference existing between the pyrogram profiles of wild and mutant sequences when using dual mononucleotide addition based pyrosequencing,we employed an enumerative algorithm to infer the mutant locus and estimate the proportion of mutant sequence.According to the profiles resulting from three runs with distinct dual mononucleotide additions,Epds could recover the wild and mutant bases.Substantial simulations revealed that our method achieved an accuracy greater than 89%for the identification of mutants with proportions higher than 0.02 in the pooled samples when the coefficient of variation for the pyrosequencing signals was fixed as 0.0016.Results also showed that our method had a false positive rate of only about 3%.Comparison with current method revealed that Epds had better performance.Finally,experiments based on profiles produced by real sequencing proved that our method could be successfully applied in the identification of mutants from pooled samples.The software implementing the method,Epds,is open source and available at http://bioinfo.seu.edu.cn/Epds.

Keywords/Search Tags:

Pooled sequencing, Group testing, Combinatorial pooled sequencing, Rare variant, Rare haplotype, Single individual haplotyping, SNP

PDF Full Text Request

Related items

1	Statistical Methods For Testing Gene-Environment Interactions With Rare Variants
2	Genetic Association Analysis Of Rare Variants
3	Nonparamatric Tests Of Association With Rare Variant Based On Haplotype
4	Rare Cell Detection Methods Based On Single-cell Transcriptome Sequencing Data
5	Intensive Detection Of Genomic Variants
6	Research Of Haplotyping Method Based On Dilution And Overlapping Pool Sequencing
7	Genomic Structural Variant Prediction Algorithm And Software
8	Data Analysis And Decoding Algorithm For The Real Time Pyrosequencing Based On Cyclical Dual Mononucleotide Addition
9	Design And Analysis Of A Viral Quasispecies Haplotype Reconstruction Optimization Algorithm
10	Gene Identification Via Phenotype Sequencing