Font Size: a A A

High-performance Biological Sequence Processing Framework For NGS Data

Posted on:2022-10-22Degree:MasterType:Thesis
Country:ChinaCandidate:H L SongFull Text:PDF
GTID:2480306314474134Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Sequencing technologies,especially next-generation sequencing(NGS),have been broadly used in clinical applications in recent years,modern sequencing technologies continue to revolutionize many areas of biology and medicine.The last decade has witnessed an explosion in the amount of available biological sequence data due to high-throughput sequencing projects' rapid progress.As the sequencing speed has been dramatically improved but with a significantly reduced economic cost,a massive amount of sequence data is generated every day in sequencing centers.Such high pressure of data volume introduces challenges to hardware support and computational scientists to process data efficiently and effectively.Over the past few years,a single processor's performance has been increasing slowly,which means that the bottleneck has reached.The improvement of the single-core processor performance has been almost stagnant.Traditional data analysis platforms and methods can no longer meet the need to rapidly processing data analysis tasks in life sciences.Therefore,how to deal with high-throughput sequencing data in the new architecture is an urgent problem to be solved.Most tools usually parse sequencing data with a single thread,which the processing speed can reach the peak performance of traditional mechanical hard disk in this step.However,with the development of storage technology,the analysis module' s efficiency can not meet the performance requirements,especially for I/O bound tools.We developed the FastIO,a fast and efficient parallel sequence processing framework to balance the low development speed of I/O compared to CPU.The main research results of this paper are as follow:1)In this project,we investigate some bioinformatics analysis tools and propose a high-performance biological sequence data processing framework for multi-core platforms——FastIO.FastIO supports the widely used FASTA and FASTQ formats and provides detailed development guidelines for researchers.2)Furthermore,we design four case studies based on FastIO in different tools to replace its original I/O module,including Ktrim,fastp,Mash Screen,and fastv2.These case studies cover different bioinformatics topics and cover both FASTQ and FASTA format sequencing data processing.We estimation the performance and scalability of FastIO by testing the original version of the tools mentioned above with a modified version with FastIO.
Keywords/Search Tags:Bioinformatics, NGS, High Performance Computing, IO Framework
PDF Full Text Request
Related items