| With the development of the Internet, user data grow more and more rapidly. Behind these data lie huge opportunities, these data are no longer being treated as useless, but are being effectively used, thereby creating economic interests in the new era.It is impossible for a single machine to analysis large-scale data. Therefore, utilization of multiple computers in processing data is widely used. Data are collected and passed to the cluster all the time, excessive storage speed will cause the cluster overloaded which would lead to data accumulation. However, too small speed will not fully exploit the resources of the cluster. It is worth to study the collection speed of data which may make full use of existing resources and real-time status of clusters.This article uses the method of Hadoop+Spark+HBase to provide solutions for data storage and processing. The best collection speed is determined by contrasting processing speed. Data storage, data processing, monitoring and performance evaluation are studied in processing large-scale data. The main work of this article contains the following:Firstly, the concept, characteristics and architectures of Hadoop, Spark and HBase are elaborated, and a cluster is built. Secondly, Spark-HDFS, MapReduce-HDFS and MapReduce-HBase are used as data filtering scheme in processing data stream. PageRank is used to find active users. Thirdly, the storage speeds, the real-time data processing speed are analyzed. Users can choose the most reasonable speed by the results in order to realize the maximization of resources. Finally, a cluster monitoring system based on JMX is developed. In this system, not only the real-time status of cluster can be got, but also the usages of CPU and memory can be accessed. The system performance evaluation standard and conclusion are presented.The results can maximize the efficient use of a cluster resource and provide optimization for large-scale data storage and processing from the perspective of data stream speed and cluster performance. A method for monitoring the status of a cluster is provided. This article proves that under current technical conditions the method designed is effective and feasible in both theory and practice. |