| With the development of high-throughput sequencing technology,sequence data is growing exponentially,and it is a hot topic to analyze and dig for valuable information from these sequence data in current research.In bioinformatics,sequences with higher similarity are obtained by pairwise sequence alignment.And then these similar sequences are further compared to predict the homology between multiple sequences.However,it is a complex and time-consuming problem to compare massive sequences entirely.In order to improve the efficiency and scalability of the pairwise comparison,this thesis researches the parallelization problem of pairwise sequence alignment based on big data technology.The main work is as follows:(1)The Blast algorithm of pairwise alignment is implemented on a single machine.The execution steps of the original software are simplified,and the result is consistent with the original software.(2)Using on the principle of equal division,the parallelization of pairwise alignment tasks based on Linux cluster is realized,which improves the comparison efficiency compared to the single machine operation.(3)Based on the configuration files of comparison tasks,the Blast algorithm is invoked by the pipe mechanism of Spark framework,realizing the processing of pairwise alignment tasks based on Spark.In this thesis,a Spark cluster environment with 16 nodes on the vSphere virtualization platform is built.A large number of comparative experiments are carried out on single machine,Linux cluster and Spark cluster.The experimental data shows that the total run time of pairwise alignment in the Spark cluster is less than in single-machine and Linux cluster environments.Moreover,with the increase of the number of cluster computing nodes,it is more efficient and scalable. |