Font Size: a A A

The Research Of Parallel Processing Techniques Of Proteomic Spectra Big Data

Posted on:2020-09-23Degree:DoctorType:Dissertation
Country:ChinaCandidate:C LiFull Text:PDF
GTID:1360330623951695Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
High-throughput tandem mass spectrometry(MS/MS)-based shotgun sequencing have become a powerful method for identifying the unknown amino acid sequences of proteins in proteomics research.MS/MS data contains valid information about proteins and peptides.Accurate analysis of MS/MS data is not only a key step in proteomics research but also a basic guarantee for subsequent analysis of protein structure functions.However,existing sequencing methods suffer from low efficiency in computation for several reasons.First,a large amount of tandem mass spectra have been generated in experiments by modern mass spectrometers.The analysis and interpretation of these spectra have become a bottleneck in proteomics research.Second,the criteria for peptide identification have become more demanding than before,e.g.,with consideration of chemical and post translational modifications and/or enzyme semi-unconstrained searches.Finally,the deposit and update of the acquired sequence data in central protein databases such as Swiss-Prot usually require inspection and analysis of large amounts of raw data files.The biggest challenge in the current computational proteomics is to quickly and accurately analyze massive mass spectrometry data.In this dissertation,the parallel techniques of largescale protein identification are investigated.The main works and innovations are summarized as follows:(1)The first contribution presents a parallel algorithm for de novo peptide sequencing;we utilized the Hadoop distributed computing framework.The state-of-the-art for de novo peptide sequencing are limited to small-scale datasets,which prohibits thorough and fast analysis of large mass spectrometry datasets.Hadoop is an open source distributed computing platform that has been widely used in academia and industry;it includes implementations of MapReduce and a distributed file system HDFS.In this contribution,we propose a novel algorithm for deriving peptide sequences from mass spectrometry data in parallel.The proposed algorithm maximizes its overall processing capacity and achieves high fault tolerance,thanks to Hadoop.It employs an efficient data rebalancing scheme to improve sequencing speed.Our algorithm also detects and recovers from runtime faults automatically and quickly.This two features guarantee the correctness and accuracy of the results.Based on this algorithm,we developed a de novo sequencing tool,namely,MRUniNovo.Our experimental results demonstrate that MRUniNovo significantly reduces the execution time needed for de novo peptide sequencing without sacrificing correctness and accuracy of the results.(2)The second contribution presents a de novo peptide sequencing tool for large-scale MS/MS spectra analysis on the SW26010 many-core processor.The explosively increasing size of MS/MS spectra dataset inevitably and exponentially raises the computational demand of the existing de novo peptide sequencing methods,which is an issue urgently to be solved in computational biology field.An effective solution is to use the high-performance heterogeneous architecture to accelerate MS/MS data processing.This contribution introduces an efficient tool based on SW26010 many-core processor,namely SWPepNovo.SWPepNovo is able to process the large-scale peptide MS/MS spectra using a parallel peptide spectrum matches(PSMs)algorithm.The proposed tool consists of two-level parallelization mechanism and three optimization strategies to overcome both the compute-bound and the memory-bound bottlenecks in the parallel PSMs algorithm.The experiments are conducted on multiple spectra datasets to evaluate the performance of SWPepNovo against three state-of-the-art tools for peptide sequencing,PepNovo+,PEAKS and DeepNovo-DIA.The SWPepNovo shows a high scalability on extremely large datasets sized up to 11.22 GB.In large-scale dataset experiments,SWPepNovo can process spectra with the speed of 282 spectra per second on nodes with a SW26010 processor,nearly 25 times faster than the existing algorithm,PepNovo+.(3)The third contribution presents a database search algorithm for large-scale peptide identification on many integrated core(MIC)architecture.MS/MS-based database search sequencing is a powerful and widely used method for high-throughput protein analysis.Because of the rapid growth of spectra data produced in advanced mass spectrometer and much more modified and digested peptides identified in recent years,the current methods for peptide database searching cannot rapidly and thoroughly process large MS/MS spectra datasets.This contribution presents MCtandem,an efficient tool for large-scale peptide identification on Intel MIC.In order to support big data processing capability,we proposed a novel parallel match scoring algorithm;we call it MIC-Spectrum Dot Product(MIC-SDP).In addition,a series of optimization strategies on both the host CPU side and the MIC side,which includes prefetching,optimized communication overlapping scheme,multithreading and Hyper-threading,are exploited to improve the executive performance.We executed the MCtandem for a very large dataset on a MIC cluster and achieved much higher speed and scalability compared with the benchmark GPU-based programs.(4)The last contribution presents a highly efficient tool for large-scale database searching with parallel spectrum dot product on SW26010.With the rapid development of mass spectrometry technology,large-scale MS/MS analysis is becoming more and more common in proteomics research.However,existing protein database search methods do not support large-scale datasets,which means that large-scale dataset analysis cannot be completed in an acceptable time.To address this critical issue,this contribution presents SW-Tandem,a new tool for large-scale peptide sequencing.SW-Tandem parallelizes the spectrum dot product scoring algorithm and leverages the advantages of SW26010 by adopting an efficient structured mass spectrometry data conversion method and a highly scalable inter-MPE communication scheme.The results of experiments conducted on multiple datasets demonstrate the performance of SWTandem against the state-of-the-art tools for peptide identification.In addition,it shows high scalability in the experiments on extremely large datasets sized up to 12 GB.
Keywords/Search Tags:proteomics, de novo sequencing, database search sequencing, high performance computing, big data processing
PDF Full Text Request
Related items