| Since the twenty-first century,along with the progress of network transmission technology and the growth of link bandwidth,the Internet users and applications grow rapidly,the amount of data increases as exponential explosive,which is the most obvious change.Massive network traffic data brings storage and computing problems,with the characteristic of high-reliability,efficiency high-scalability,high fault tolerance and low-cost,Hadoop platform becomes a massive network traffic data analysis platform.However,as the volume of data grows rapidly,Hadoop has become increasingly powerless.at this monment,Spark came into being.Compared to MapReduce,Spark is more concise,more efficient.Facing increasingly network traffic data,network performance analysis of massive data analysis is particularly important.In this thesis,the Hadoop data analysis platform is introduced,and the calculation model MapReduce and distributed file system HDFS are briefly described,and focuses more on the Spark calculation framework,including Spark overall architecture,core concepts,job execution processes and Shuffle.Then,based on massive data analysis applications,proposed to the appropriate operator,improve the data local,persistence and select the appropriate degree of parallelism and other performance optimization methods to optimize the operation,and experimental evaluation of comparative performance.Next,based on common operation in Spark-join,e.g PageRank algorithm,achieved the optimization and performance evaluation of the join,which is very instructive for applications which requires the join operation,especially for the recursive scene of multiple joins. |