| With the increasing popularity of mobile Internet,various kinds of data such as voice,image,video,etc.show an explosive growth trend,which makes the processing of massive data an important challenge nowadays.In such a context,Hadoop and its ecosystem have become the de facto standard for big data processing,providing diverse applications and tools for the processing of massive data.At the same time,the processing of massive data also drives the rapid development of parallel computing for machine learning and data mining algorithms,among which matrix algorithms,as the basis of various machine learning algorithms,data mining algorithms and other algorithms,are particularly important for their parallel implementation.In this regard,Spark,as a new computing framework that makes up for the shortcomings of Map Reduce framework,is the best choice for implementing parallel matrix algorithms with its advantage of seamless integration with the Hadoop ecosystem.In this thesis,we build a small parallel matrix library based on the features of Spark,including different representations of distributed matrices,such as block-by-row and block-byblock,and implement parallel matrix multiplication operations and matrix Smith standard type computations.Among them,the parallel matrix multiplication operation is the core of other parallel matrix algorithms.In this thesis,we first implement a Spark-based parallel matrix computation algorithm,which can handle the multiplication operation of large-scale dense matrices,and analyze the bottleneck of the algorithm through experiments,and optimize the process of multiplying with the primary transformation matrix involved in Smith standard type computation,which greatly reduces the amount of data transmitted by the network and improves the performance of the algorithm.At present,the parallel computation algorithm of Smith standard type is only applicable to single machine operation and cannot handle largescale distributed matrices.Therefore,this thesis proposes the maximum convention number algorithm,which can support the Smith standard type computation algorithm for parallel computation of chunked matrices,and is implemented in the Spark computing framework,and this new algorithm can handle larger scale matrix operations.Finally,this thesis conducts relevant experiments on the algorithm to verify its correctness and scalability. |