Design And Implementation Of Log Analysis Tools Based On Spark

Posted on:2018-12-02

Degree:Master

Type:Thesis

Country:China

Candidate:Z H Ma

Full Text:PDF

GTID:2348330518498967

Subject:Engineering

Abstract/Summary:

PDF Full Text Request

In recent years,with the rapid development of computer science and technology,the major Internet companies have introduced a variety of products to meet people’s basic needs of food and clothing,these products produce massive data every day,to solve the data storage and processing of large data industry come into be.From a technical point of view,a single computer has been unable to complete such a large amount of data storage and computing work,needs a large number of servers distributed in different areas composed of distributed computer clusters,providing a powerful distributed computing power to complete large data processing.As a distributed computing framework,Spark becoming more and more popular,has been widely used by developers in the business.But Spark system is too complex,ordinary users do not understand the underlying operating principle,the performance tuning needs a lot of professional knowledge,in production environment,there is no tuning tool based on Spark log analysis.Developers want to know the performance data at the bottom of cluster when Spark job is running from the Spark log,find problems that may exist in the system,perform performance tuning,improve computational efficiency,and reduce tasks operation hours.Therefore,the development of a set of Spark log analysis system for Spark users and even Spark developers to optimize performance is significant.In this paper,aim at the lack of Spark log analysis in production environment,studied the design and implementation of log analysis based Spark.Based on the elaboration of Spark’s ecosystem and the basic concept of EGO cluster,this paper introduces the current situation of Spark log analysis and Spark performance tuning at home and abroad.From the perspective of Spark source code introduced a Spark APP’s life cycle,describes how a spark job being divided and implemented step by step after its submission.Introduced IBM’s Cw S technology,and detailed the design model based on the subscription release,Spark used that model to distribute and record events,introduced the Spark event Log format and content.This paper analyzes the business requirements,function and performance requirements of the log analysis system,analyzes the factors that affect the execution of Spark operations from the aspects of scheduling performance and task performance,and selects the indexes that affect the performance of Spark system.Researched the data locality of Spark operation,designed a data local evaluation methods,these two as a log analysis system based.The design of the application architecture of the system is given,and the design of the application architecture is further discussed and supported by the detailed design of each module.The function of data collection and preprocessing and the data persistence function are realized from the application,job and stage Data analysis,and different granularity of the analysis results through different types of statistical charts for output display.Finally,by setting up the experimental environment,the design experiment verifies the function of the log analysis system.The main result of this paper is the Spark log analysis tool.At present,the system is running well,has achieved satisfactory results,successfully collected the log data,conducted a performance analysis and Spark operation of the operation through the intuitive display to the Spark users and developers,effectively help IBM’s Spark developers Understand the bottom of the cluster operation,for the performance tuning work has some guiding significance.

Keywords/Search Tags:

Log analysis, big data, Distributed Computing, Performance tuning

PDF Full Text Request

Related items

1	Performance Analysis & Tuning For Enterprise Data Warehouse Based On Big Data
2	A System For Distributed MD Data Analysis Based On Spark
3	The Performances Of Distributed Big Data Processing Modes In High-speed Traffic Network
4	The Profling And Memory Analysis On Typical In-memory Computing Big Data System
5	Research On Optimization Of Map Reduce For Interactive Analysis On Big Data
6	Research And Implementation Of The Data Transfer And Management In Distributed Computing
7	GPU Computing In Massive Data Processing
8	Tuning Hardware and Software for Multiprocessors
9	Benchmarking And Tuning Distributed Streaming Platforms
10	Technology And Tool Desgin Of Performance Tuning For Parallel Programing On Multi-core Platform