Font Size: a A A

Fast Error Correction Method Of NGS Data Based On K-spectrum Algorithm

Posted on:2019-05-11Degree:MasterType:Thesis
Country:ChinaCandidate:Q LiuFull Text:PDF
GTID:2370330545950681Subject:Software engineering
Abstract/Summary:PDF Full Text Request
In recent years,the Next Generat ion Sequencing(NGS)techno logy has been wide ly used.The amo unt o f data generated by var ious b iolo gical inst it ut ions and laboratories has increased rapidly across the globe.But NGS data still faces a prominent prob lem.That is,its data error rate is high,whic h fur ther results in inaccurate ana lysis in the downstrea m.The K-spectrum a lgor it hm is a co mmon met hod used for error correctio n.It divides the data into sma ller fra gments and detects whic h are wrong according to the frequenc y of the fragme nts,and the n uses an error correctio n module to correct the errors.The exist ing error correction met hod based on t he K-spectrum a lgorit hm cannot meet t he demand for accurate,rapid,and low-cost processing of t he exist ing large a mount of sequenc ing data because of its limitat io ns,As a result,a more effic ient and e ffect ive error correction met hod is needed.This paper introduces the princ iple of K-spectrum algor ithm,ana lyzes the limitat io ns of exist ing error correctio n tools based on this algor ithm,and finds out the ir respective proble ms and causes.The Musket too l emp loys a mult i-stage process ing approach in the error correctio n process.Due to the use of reference fragme nts for error correctio n,it is easy to introduce new incorrect bases to a certain extent;the Blue and Rept ile tools all use the wrong base posit ions.The context informat io n replaces the wrong sequence se gme nt.Alt hough t he error correction effic ie ncy is high,the accuracy is not high.In order to solve the problems of the exist ing tools,we have increased the data preprocessing stage,and filtered the Illumina-generated sequenc ing data according to the quality scores of t he sequence data to reduce the low-qua lit y sequences and reduce the proble ms such as base delet io n and character interfere nce in the data files.The n according to the classica l practice,the data is k-mer ized,and the k-mer ized data is d ivided into a trusted set and an error set.Afterwards,an improved De Bruijn d iagra m is constr ucted to transfor m the error correctio n proble m into graph matching and retrie va l.In order to improve the accuracy of t he error correctio n,the A* a lgor it hm and t he Needle man-Wunsch fract iona l met hod are introduced in the program to solve the path search and correction proble m of the wrong segme nt in the sequence.In addit ion,in order to improve error correction effic ienc y,hashing techno logy and concurrent que ues are wide ly used in the progra m,which solves the storage and retrie val proble ms of large data sets.Based on the above considerat ions,this paper designs an ASEC(A Star of Error Correct ion)method based on K-spectrum for NGS data and distr ibutes the data on the Spark d istr ibuted clo ud comput ing plat form to improve the algor ithm.The speed of error correction.Through compar ison exper iments wit h other error correction too ls,it is proved that t he ASEC met hod is better than c urrent ly popular error correctio n tools.The ASEC method has higher error correctio n effic ie ncy witho ut compro mis ing the accuracy of error correct ion.The operation under the Spark plat form a lso shows a good distr ib uted processing capabilit y of t he error correction a lgor ithm,and t he erro r correction r unt ime is greatly reduced.However,because of t he limited time,the use of a fixed lengt h of coverage in t he current phase imposes certain limitatio ns on the algorithm and also provides room for further optimization.
Keywords/Search Tags:NGS, K-spectrum, Correction Method, Spark
PDF Full Text Request
Related items