In recent years,the proliferation of biological data has brought new challenges to the various fields of bioinformatics calculations.As one of the important topics of bioinformatics research,the problem of gene sequence assembly and alignment is inevitably faced with the problem that the data processing speed is slow and the serial algorithm is outdated.How to achieve efficient gene assembly and gene alignment has become an urgent problem to be solved.The development of large data processing technology,especially the development of Hadoop and Spark platform,which supports this technology,makes us have more and more computational resources available,which provides a good opportunity for the solution of the problem.In this thesis,based on the poor calculation accuracy and long computation time of existing algorithms and software in the current gene sequence Denovo assembling,we propose an assembling algorithm SA-BR(Sequence Assembly Based on Read)based on strategy of memorizing k-mers’ source in the process of assembling.Then design parallel algorithm SA-BR-MR(SA-BR based on MapReduce)based on MapReduce by fully studying the distributed parallel computing framework MapReduce.By programming,the assembling results can come up to the accuracy of 100%,and the calculation time has been a greater degree of improvement compared with the literature.For the multiple sequence alignment problem,since there is no parallel software with both accuracy and efficiency for large-scale DNA sequences,this thesis proposes an improved center star alignment algorithm(MSA-K-Mers)based on the strategy of computing the co-k-mers to calculate the central sequence.By programming the Map and Reduce function,propose improved center star alignment algorithm based on MapReduce(MSA-K-Mers-MR).The calculation of 96 mitochondrial gene sequences for each length 15 k shows that the algorithm not only can improve accuracy,but also can reduce a lot more times time than the software’s computation time.Because the advantages of memory computing for Spark,we analyze the characteristics of the two algorithms and Spark’s parallel computing mechanism and by using of parallel algorithm design concept,re-parallel design and construct the two algorithms SA-BR-Spark and MSA-K-Mers-Spark using suitable RDD,the results show that the calculation of time efficiency greatly improved.The thesis’ s contents and research results are as follows:1.In this thesis,we propose an algorithm SA-BR(Sequence Assembly Based on Read)based on concept of k-mers and memorizing k-mers source.After dividing reads into k-mers,in the process of re-numbering all k-mers,we record k-mers’ information at the same time,also record the read number from which the k-mers’ come and this k-mer’s index in its read.In the process of assembling,due to the record of these information,the same k-mers from the same read will not assemble wrongly.At the same time,by giving full play to MapReduce framework to analyze the characteristics of the data and under the guidance of parallel algorithm design concept,propose the parallel algorithm SA-BR-MR(SA-BR based on MapReduce)based on MapReduce.The randomly selected 54 sequences of animals,plants and microbial species from tens of thousands were calculated,the matching rates all can reach 100%.2.In the calculation of 54 sequences,the upper and lower bounds of k are recorded.Hepatitis C virus was used as an example,and eight kinds of hepatitis C virus variants were randomly selected.The calculated results verified the correctness of k value.The range of K value provides a theoretical basis for the sequencing of such viruses.3.Based on the analysis of the algorithm on the base of k-mers concept and memory k-mers source,this thesis fully studies the rules of RDD correlation function in Spark environment,re-design the algorithm,program it and propose improved assembly algorithm based on Spark(SA-BR-Spark).Under the same conditions,compared with the MapReduce framework,it can save about 50% of the calculation time,and give full play to the time advantage of memory computing.4.For the time and space complexity of multiple sequence alignment problem,and no existence for accurate and efficient alignment algorithm and software tool of large-scale genome of DNA,on the basis of fully studying the star alignment algorithm,we propose an improved center star alignment algorithm(MSA-K-Mers)based on the strategy of computing the central sequence by refining the shared k-mers,rather than randomly select any sequence as a central sequence.By make full use of data analysis advantages of MapReduce,cleverly conform the enter data into Map and Reduce function required <key,value> data,design the improved center star alignment algorithm based on MapReduce(MSA-K-Mers-MR).The calculation of 96 mitochondrial gene sequences with length of 15 k indicates that the algorithm is less than the literature in computing time as the accuracy of the algorithm is improved.However,compared with the classical software Clustlw,the calculation time is doubled shorten.5.This thesis analyzes the characteristics of the iterative calculation for Spark platform,analyzes the idea of obtaining center sequence two times in improved center star sequence alignment algorithm and comparing the center sequence with each input sequence twice,analyzes which stages of the algorithm can be parallel and what the process can only be serialized,analyzes the overall process of the designed algorithm,analyzes the RDD number,data format,iterative calculation of the entrance and exit,propose improved center star alignment algorithm based on Spark(MSA-K-Mers-Spark).By coding,running in the local multi-threaded mode,compared with the MapReduce platform,the calculation time greatly improved.The flexible use of flexible distributed data sets RDD greatly facilitates parallel programming.6.By further parallelizing the process of pairwise alignment between center sequence and each input sequence in MSA-K-Mers-Spark,proposes MSA-K-Mers-Spark and named Improved-MSA-K-Mers-Spark.Experimental results show that Improved-MSA-K-Mers-Spark compared with MSA-K-Mers-Spark,the calculation time is only one-sixth of the latter. |