Study On Web Log Storage And Preprocessing Optimization Based On Hadoop

Posted on:2017-02-17

Degree:Master

Type:Thesis

Country:China

Candidate:Y Y Song

Full Text:PDF

GTID:2308330485490008

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

The Web log is rapidly expanded with the development of Internet, mobile Internet and other technologies. Since the Web log records the browsing behavior that the Internet users access Web pages, which has an important guiding significance for the website building and providing more accurate service for clients. However, an original Web log file contains many incomplete, redundant and mistaken data, so that it is difficult to directly use the data in Web log, and it is possible to obtain wrong result. Therefore, performing preprocess on Web log data is necessary. Meanwhile, the storage constraints of traditional relational database and the limitations of single-node data processing is considered. This paper adopts the Hadoop distributed processing platform to perform data storage and pre-processing operations on Web log, the main content include:(1) Web log data storageWith the rapid growth of massive Web logs, the traditional storage technology faces the problems that the construction cost is high, operation and maintenance is complex, and the scalability is limited. However, the popular cloud database has dynamically scalable, high scalability, high throughput, and low cost advantages. Therefore, this paper will consider that the Web logs is stored in the Hadoop HBase database, and make full use the advantage of distributed processing cluster.(2) HBase load balancing optimizationThe data storage manner in HBase largely decides the performance of the entire cluster, and direct impact on the efficiency of subsequent reading operation. When the MapReduce reads the Web log data stored in the HBase, which will cause the â€œhot spotsâ€ problem. Based on this problem, this paper proposes an improved load balancing algorithm that is an HBase load balancing algorithm based on child table limitation, in the process of allocating child tables, we consider the distribution case of cutting child table region, except for the load condition of HRegionServer, thereby achieving cluster load balancing to the maximum extent.(3) Web log preprocessing by using MapReduceThe Web log preprocessing operations influences the quality of Web mining, and when the increased Web logs is processed, the computing power of a single node gradually reveals drawbacks. However, MapReduce supports the large-scale cluster operation. After analyzing the Web log preprocessing process, this paper reads data from HBase, and adopts MapReduce computation model to process the preprocessing operation of Web log.Experimental result shows that the optimized HBase load balancing algorithm can effectively solve the problem of the load access imbalance, in an appropriate cluster environment, and has a high efficiency when MapReduce performs the Web logs preprocess. Finally, this paper optimizes the preprocessing algorithms and verifies the efficiency of the optimized algorithm.

Keywords/Search Tags:

Web log preprocessing, Hadoop, HBase load balancing, MapReduce

PDF Full Text Request

Related items

1	Research On Load Balancing Algorithm For Scheduling Based On Hadoop
2	Research On Optimization Of Data Load Balancing In Hadoop Clusters And Application Of Haddoop Platform
3	Research On MapReduce Performance Optimization Based On Hadoop
4	Research Of Job Scheduling On Hadoop Platform Based On Load Balancing
5	Research And Implementation Of Load Balancing Strategy Based On Distributed Database HBase
6	Research On Energy-aware Load Balancing In Heterogeneous Hadoop Cluster
7	The Research Of Load Balancing In Mapreduce Based On Sampling Estimation
8	Vehicle Routing Data Processing System Based On Hadoop And C4.5 Algorithm
9	Research Of Passenger Volume Prediction Based On Hadoop Platform
10	Research On Lightweight Load Balancing Under Mapreduce