| With the development of big data and cloud computing,an increasing number of users use big data processing platforms to analyze and mine the hidden information in large-scale datasets.For example,e-commerce companies use data analysis platforms,which are based on Hadoop and Spark to maintain accurate product recommendation and business process modeling.In order to meet the needs of operation and management of large-scala data center,Internet companies usually handle log data by using Flink,to improve the responsiveness of workload spikes.However,due to the complexities of operation and management,setting up a big data processing platform is very costly.Hence,some cloud service providers release their big data processing platforms as Saa S services to public users,such as Ali Cloud E-Map Reduce,Amazon EMR and Microsoft Azure HDInsight.End users can quickly get a required platform by purchasing the cloud-based big data processing services,and the target platform can be easily reconfigured and resized as needed.Different deployment types of the cloud-based big data processing platforms have significant impacts on their performance.Underlying file storage systems,such as local file systems or remote file storage services,can cause large differences in the performance of big data processing platforms.In order to fulfill the ever-increasing demand for big data processing,the expansion mode,i.e.,scale-out or scale-up,can have impacts on the performance of corresponding system greatly.Therefore,how to purchase,build and configure a low-cost,high-performance and highly scalable big data processing platforms is an essential issue for end users as well as cloud service vendors.This paper did the following research work:(1)In order to solve the problem of scalability measurement on the cloud-based big data processing platforms,we used Ali Cloud E-Map Reduce as a testbed platform and conducted a series of extented experiments.We proposed a novel scalability metric and it was validated on Hadoop clusters.(2)To evaluate the impact of different file storage systems on big data processing platforms,we analyzed the performance differences between Hadoop and Spark on HDFS(Hadoop Distributed Filesystem)and OSS(Object Storage Service).In order to evaluate the impact of different scale manners on the performance of big data processing platforms,the performance between different workloads were analyzed,so as to guide the user how to optimize the system based on the load.(3)Aiming at the performance factors found in the aforementioned performance analysis,a variety of performance optimization approaches have been proposed,including: by compressing the output of the Map side of the Map-Reduce task,reducing the amount of data transmitted by the network,and designing an algorithm Inter Map Compressor,we successfully compress the Map results,so as to optimize the Shuffle performance and reduce the execution time of the task;by increasing file block size in file system to improve the performance of Hadoop for CPU-intensive workloads;in addition,to tune the parameter.Finally,a large number of optimization experiments were carried out on Ali Cloud E-Map Reduce platform,and the results verified the effectiveness of the proposed optimization approaches.The work presented in this paper,including the metric of scalability,performance optimization approaches and experimental findings,it can help not only end users in various aspects such as the platform selection,deployment and workload placement,but also the cloud service vendors on optimizing cluster deployment and capacity planning. |