Font Size: a A A

Massive Data Many Task Parallel Data Framework For GWAS

Posted on:2018-02-25Degree:MasterType:Thesis
Country:ChinaCandidate:P H LiuFull Text:PDF
GTID:2370330623450961Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
The widespread use of high-throughput sequencing technologies has led to the explosion of genetic data.Gene data analysis faces the problem of how to effectively organize and access PB and even ZB-level data.Million People genome Sequencing projects have been started,Even genome-wide association studies of millions of people are expected to make breakthroughs in the study of common diseases and complex traits.Massive data analysis of gene big data does not stop it from moving forward due to the storage wall.At present,parallel access to these petabytes of gene data has serious performance bottlenecks,including the contradiction between data storage and access methods,mass files Metadata congestion caused by reading and writing,network bandwidth and memory bandwidth pressure of existing high-performance computing hardware.This paper presents a mass data parallel framework for genome-wide association analysis based on HDF5 data format that has many years of proven experience in high-performance environments,multi-tasking parallelism for correlation analysis,data slicing,data transpose,Data filtering,data high-throughput encoding and compression,border alignment,and combined with Tianhe-2 hardware aggregation cache optimization technology to solve the data local conflicts and improve the continuity of data access and reduce the pressure on the metadata server,and further Combined with the characteristics of such association analysis,data filtering,type conversion,column storage compression and other methods reduce the overhead,and ultimately improve data access performance by more than 10 times,and scalability also performed well.Data sharding and data transpose refer to VariantDB idea,establish the annotation of the data domain,then use the database to query and filter the annotation.Based on HDF5 format's flexibility,data reorganization can be easily performed.After data reorganization and calculation Jobs directly and efficiently access its directly related data,data transpose realizes the continuous polymerization of the same computing jobs,improve access data access continuity,reduce the frequency of small metadata access,reduce the Luster metadata server pressure.Data filtering and high-throughput coding compression is combined with the characteristics of genetic data analysis and calculation tasks,and will not be related to computing jobs in the data preprocessing stage to wash away,can increase data density and reduce network and memory bandwidth pressure,the results show that data filtering can be achieved 51.8 times the data reduction effect,combined with data encoding compression can achieve 579 times the reduction effect.Combined with the optimization of Tianhe No.2,HDF5 is used to tune the Luster system,utilizing HDF5's block,alignment,caching and Luster server's metadata caching and aggregation caching features to achieve parallel IO while Reduce the number of compute nodes that interact with the IO server,avoid unnecessary congestion,and improve metadata access performance,and increase data access speed by a factor of 10.
Keywords/Search Tags:Matrix transpose, Data filtering, Storage wall, Parallel data framework, Big Data Analysis, Parallel tuning
PDF Full Text Request
Related items