Font Size: a A A

Some Mathematical Questions About De Novo SNP Calling

Posted on:2013-11-06Degree:MasterType:Thesis
Country:ChinaCandidate:J Z DouFull Text:PDF
GTID:2230330377952407Subject:Operational Research and Cybernetics
Abstract/Summary:PDF Full Text Request
Next-generation sequencing(NGS) technologies have revolutionized genomics andtranscriptomic approaches to biology, leading to a tremendous increase in the amount ofavailable sequence data. Single nucleotide polymorphism(SNPs) are the most abundant typeof genetic variation in eukaryotic genome,and are considered to be the ideal marker of choicein a wide variety of applications such as GWAS,genetic mapping,QTL analysis,phylogenomics,population genetic studies and so on. Recently, several methods such asRAD,GBS,RRLs and so on have been developed for high-throughput de novo SNP discoveryand genotyping based on NGS platforms,most of which utilize restriction enzymes forgenome complexity reduction(GCR) to reduce the total sequencing cost.Most of existing SNP calling algorithms depend on the reference-based read mappingapproach, thus limiting their use in non-model species for which the reference genome isusually not available because of the short reads(30~100bp).Comparing with the referencebased methods, three difficulties lead to constraints on de novo SNP calling.The first one ishow to exclude the SNP which are actually derived from repetitive regions, the second one isthe relation between average read coverage and SNP detection probabilities,the last one isdealing with sequencing errors,which can masquerade as SNPs.In this paper,some mathmatical questions about de novo SNP calling based on the GCRmethods are discussed in further,for example,the effect of least number of reads for eachallele, complexity of genome. According to the simulation, filetering the clusters which havesubstitution differnces supported by at least two reads for each allele is an effective approachto enhance to exclude the sequcing errors on SNP calling. Besides, the reasonable sequecingcoverage is15~20X with FPS lower than2%and FNS lower than5%.Most eukaryotic genomes contain a remarkable portion of sequences that are repetitive orclose to repetitive on the length scale of the short read. False SNPs would arise and bemiscalled from read clusters in which reads carrying different sequence variants are actuallyderived from distinct genomic locations (i.e. repetitive regions). In general, such compositeread clusters should have greater cluster-depth than the normal (i.e. non-composite) ones,which can be utilized to identify composite clusters and further exclude them from SNPcalling. Here we demonstrate that the accuracy of de novo SNP calling can be remarkablyimproved using a modified ML algorithm (thereafter called MP-ML) which incorporates themixed Poisson model to identify composite clusters and therefore prevents wrong SNP callsresulting from repetitive genomic regions. The MP-ML algorithm is especially powerful inthe cases of very short read length and high genome complexity. At last, we created twoseries of RAD simulation datasets to demonstrate the power of the MP-ML algorithm.
Keywords/Search Tags:NGS, SNP calling, MP-ML algorithm
PDF Full Text Request
Related items