Font Size: a A A

Research On VCF Format Genomic Data Compression And Parallelization Based On Domestic Bigdata All-in-one Machine

Posted on:2022-02-10Degree:MasterType:Thesis
Country:ChinaCandidate:S ChenFull Text:PDF
GTID:2480306557468714Subject:Software engineering
Abstract/Summary:PDF Full Text Request
The Internet of Everything makes the growth of data show an exponential upward trend,and the bigdata all-in-one machine is one of the effective means of processing bigdata in the future.At present,domestic bigdata all-in-one machines still have shortcomings such as insufficient security,poor performance,and few applications.The research of genome sequencing is widely used in many important fields such as biology,medicine,genetic science and so on.For example,during the COVID-19,studying the genome sequence of the virus has made a huge contribution to human understanding of the virus.In the past two decades,with the advancement of genome sequencing technology,genome data has increased by several orders of magnitude,and the storage and transmission of sequencing data has become a bottleneck in this field.At present,gzip is mainly used to store VCF files,and the compression rate and compression speed are low.Therefore,this dissertation proposes an efficient genome data compression algorithm based on the domestic bigdata all-in-one machine.The main work is as follows:(1)For the lack of genome sequencing applications in domestic bigdata all-in-one machines,a VCF format genome sequencing file compression algorithm vcfzip is proposed.The algorithm realizes lossless compression with high compression rate through block division,information extraction,coding and exception handling.The block design allows users to adjust the compression performance according to the machine performance and provides favorable conditions for parallelization.Information extraction refers to the classification and extraction of different types of data in the VCF file into the memory,which reduces the difficulty of encoding and increases the speed of encoding while ensuring the integrity of the data.The coding is divided into large number coding,base coding and GT coding,the purpose is to reduce the storage density of data with different characteristics.Exception handling maintains the integrity of dirty data in the information extraction stage and ensures lossless compression.(2)Aiming at the problem of poor performance of domestic bigdata all-in-one machine,this dissertation applies the genome compression algorithm vcfzip in the domestic bigdata all-in-one distributed cluster and proposes a parallel algorithm vcfzip?hadoop.It modifies the DAG algorithm vcfzip to have parallel capabilities and improves the parallel performance of the algorithm by optimizing slicing and partitioning.Experiments prove that distributed parallelization greatly improves the execution speed of the algorithm and can dynamically increase the compression speed as the number of cluster nodes increases.(3)In order to facilitate users to use the above two algorithms on the domestic bigdata all-in-one machine,this dissertation designs a web system based on the domestic bigdata all-in-one machine to compress VCF genome data.In the system,users can not only compress genome data in batches quickly through a simple interface,but also monitor the running status of the system in real time.
Keywords/Search Tags:Bigdata All-in-one Machine, Genome Compression, VCF, Parallelization, Hadoop
PDF Full Text Request
Related items