Performance Monitoring And Optimization For Large-scale Data Processing In Cluster

Posted on:2016-07-14

Degree:Master

Type:Thesis

Country:China

Candidate:X M Wang

Full Text:PDF

GTID:2308330461477180

Subject:Software engineering

Abstract/Summary:

PDF Full Text Request

With the development of the Internet, user data grow more and more rapidly. Behind these data lie huge opportunities, these data are no longer being treated as useless, but are being effectively used, thereby creating economic interests in the new era.It is impossible for a single machine to analysis large-scale data. Therefore, utilization of multiple computers in processing data is widely used. Data are collected and passed to the cluster all the time, excessive storage speed will cause the cluster overloaded which would lead to data accumulation. However, too small speed will not fully exploit the resources of the cluster. It is worth to study the collection speed of data which may make full use of existing resources and real-time status of clusters.This article uses the method of Hadoop+Spark+HBase to provide solutions for data storage and processing. The best collection speed is determined by contrasting processing speed. Data storage, data processing, monitoring and performance evaluation are studied in processing large-scale data. The main work of this article contains the following:Firstly, the concept, characteristics and architectures of Hadoop, Spark and HBase are elaborated, and a cluster is built. Secondly, Spark-HDFS, MapReduce-HDFS and MapReduce-HBase are used as data filtering scheme in processing data stream. PageRank is used to find active users. Thirdly, the storage speeds, the real-time data processing speed are analyzed. Users can choose the most reasonable speed by the results in order to realize the maximization of resources. Finally, a cluster monitoring system based on JMX is developed. In this system, not only the real-time status of cluster can be got, but also the usages of CPU and memory can be accessed. The system performance evaluation standard and conclusion are presented.The results can maximize the efficient use of a cluster resource and provide optimization for large-scale data storage and processing from the perspective of data stream speed and cluster performance. A method for monitoring the status of a cluster is provided. This article proves that under current technical conditions the method designed is effective and feasible in both theory and practice.

Keywords/Search Tags:

Large-scale Data Processing, Hadoop, Spark, Performance Monitoring

PDF Full Text Request

Related items

1	The Analysis And Monitoring Of Data Models In Different E-commerce Rule Engines
2	English On Design And Implementation Of Network Data Parallel Processing System Based On Hadoop Platform
3	Study On Data Fusion Of The Large Scale Carbon Cycle Model Based On Spark
4	Research On Parallel Clustering Algorithm For Large - Scale Data Set
5	Research On Large-scale Traffic Classification Technology Based On Spark Performance Optimization
6	The Research On Distributed Task Scheduling Algorithms Based On Hadoop Platform
7	Performance comparison by running benchmarks on Hadoop, Spark, and HAMR
8	Structured Data Processing And Performance Optimization Of Spark SQL
9	Design And Implementation Of Monitoring System For Large Scale Data Center
10	Research On Consumer Behavior Based On Large Scale Of E-commerce Data