| The Web log is rapidly expanded with the development of Internet, mobile Internet and other technologies. Since the Web log records the browsing behavior that the Internet users access Web pages, which has an important guiding significance for the website building and providing more accurate service for clients. However, an original Web log file contains many incomplete, redundant and mistaken data, so that it is difficult to directly use the data in Web log, and it is possible to obtain wrong result. Therefore, performing preprocess on Web log data is necessary. Meanwhile, the storage constraints of traditional relational database and the limitations of single-node data processing is considered. This paper adopts the Hadoop distributed processing platform to perform data storage and pre-processing operations on Web log, the main content include:(1) Web log data storageWith the rapid growth of massive Web logs, the traditional storage technology faces the problems that the construction cost is high, operation and maintenance is complex, and the scalability is limited. However, the popular cloud database has dynamically scalable, high scalability, high throughput, and low cost advantages. Therefore, this paper will consider that the Web logs is stored in the Hadoop HBase database, and make full use the advantage of distributed processing cluster.(2) HBase load balancing optimizationThe data storage manner in HBase largely decides the performance of the entire cluster, and direct impact on the efficiency of subsequent reading operation. When the MapReduce reads the Web log data stored in the HBase, which will cause the “hot spots†problem. Based on this problem, this paper proposes an improved load balancing algorithm that is an HBase load balancing algorithm based on child table limitation, in the process of allocating child tables, we consider the distribution case of cutting child table region, except for the load condition of HRegionServer, thereby achieving cluster load balancing to the maximum extent.(3) Web log preprocessing by using MapReduceThe Web log preprocessing operations influences the quality of Web mining, and when the increased Web logs is processed, the computing power of a single node gradually reveals drawbacks. However, MapReduce supports the large-scale cluster operation. After analyzing the Web log preprocessing process, this paper reads data from HBase, and adopts MapReduce computation model to process the preprocessing operation of Web log.Experimental result shows that the optimized HBase load balancing algorithm can effectively solve the problem of the load access imbalance, in an appropriate cluster environment, and has a high efficiency when MapReduce performs the Web logs preprocess. Finally, this paper optimizes the preprocessing algorithms and verifies the efficiency of the optimized algorithm. |