Font Size: a A A

The Pipeline For SNP Calling Based On Maize High-throughput Sequencing Data And Its Application

Posted on:2016-09-11Degree:MasterType:Thesis
Country:ChinaCandidate:T LiFull Text:PDF
GTID:2323330512972292Subject:Crops
Abstract/Summary:PDF Full Text Request
SNP(single nucleotide polymorphism)has been widely used in molecular marker assistant selection,hybrid seed purity test,QTL/gene mapping,genetic linkage mapping(like genome-wide association mapping,especially).As a third generation molecular marker,SNP has its several advantages,for instance,genome wide distribution,genetic stability,fast and easy for detection,et al.After a long natural selection and artificial domestication,the maize has become a staple crop with a rich genetic variation.However,a large number of repeat sequences and transposons in the genome bring difficulties and challenges for detecting the mutation in maize.In present research,empirical Base-Calling and GC%-depth profiles trained from maize real re-sequencing data were used to simulate high-throughput sequencing data varied in read-length and coverage;four SNP calling programs SAMtools,GATK-UnifiedGenotyper,VarScan and FreeBayes were used to identify SNP.Finally,the results were evaluated according to SNP calling rate,operation efficiency and false positive rate,from which the best pipeline for calling maize SNP with the best combination of coverage and read-length of maize genomic re-sequencing was figured out.Firstly,we used the pIRS program to construct the characteristic spectrum of maize basing on reference genome sequence,then using Illumina sequencing data to simulate the different coverage and different read lengths,four SNP calling programs were used to identify SNP.The results showed that the accuracy of SAMtools,VarScan and GATK-UnifiedGenotyper was higher while FreeBayes gave the higher false positive rate.When the coverage was less than 8-fold,SAMtools and GATK-UnifiedGenotyper had similar accuracy,but SAMtools had 15%higher calling efficiency than another.The coverage was over 8-fold,FreeBayes,SAMtools and GATK-UnifiedGenotyper had about same in SNP calling accuracy and efficiency while VarScan could not.When the coverage was more than 30-fold,four programs had almost same in SNP calling accuracy and efficiency.What is more,we found the SNP loci detected simultaneously and accuracy of them was as high as 99.98%.Therefore,when the data is less than 8-fold coverage sequencing,SAMtools is recommended.When more than 8-fold,GATK-UnifiedGenotyper has best preformance,which suggests multiple-program calling has better SNP calling accuracy than the single.Secondly,the optimal method for maize could be that paired-end read in 100 bp combined 15-fold coverage,GATK-UnifiedGenotyper for SNP calling rate is 85.9%and accuracy 99.84%;or paired-end read in 150 bp with 8-fold coverage,of which the SNP calling rate is 86.4%and accuracy 99.92%.Paired-end read in 250 bp with 5-fold coverage,SAMtools for SNP calling rate is 81.6%and accuracy 99.82%.So paired-end read in 150 bp with 8-fold coverage are the best combination of coverage and read-length for maize genomic resquencing.Finally,we used GATK-UnifiedGenotyper and VanScan to call SNP based on the data of H99,paired-end sequencing with 11-fold coverage,which called 6,885,936 and 4,878,937 SNP loci respectively.And then,38 SNPs called by GATK-UnifiedGenotyper but not VarScan,were randomly chosen for sequencing test.Among these 38 SNPs,36 were proved.This result demonstrated this optimal pipeline(GATK-UnifiedGenotyper)had high efficiency and accuracy of SNP calling in maize.The best pipelines were integrated into a PERL application program and added some other applications.In general,present research can serve as a valuable reference to the SNP calling researches on maize and other species,in the respect of selecting pipeline,sequencing coverage and read-length.
Keywords/Search Tags:high-throughput sequencing, SNP, maize, detect variant, pipeline, application
PDF Full Text Request
Related items