| The vigorous development of the Internet has been gradually changing people’s life style and pace of life, work, study and life integrate the Internet factors in all aspects. The Internet is like a huge database and contains infinite and rich diversity of information. It has been a hotspot in the society that how to discover connotative and valuable information from massive web data by data mining technology.Through processing and analyzing the Web log data, it has important theoretical and practical significance to improve the performance of the Web site, improve the design of the Web site, and dig the potential usage patterns and rules of the users.The thesis improved the deficiencies of the existing algorithm and proposed new more efficient methods of Web log mining based on the research of data preprocessing methods and the analysis of association rule algorithms in Web log mining technology. In addition, we developed a set of effective Web log processing analysis system based on big data platform and technology. The thesis mainly work as follows.Firstly, we study the process and method of data preprocessing in Web log mining, and propose a session identification algorithm based on Web page and web content. This algorithm accurately identify the user’s sessions by setting threshold time dynamically according to homepage,navigation page and Web content etc. The experiments show that the improved session identification algorithm has significantly improved the accuracy of the algorithm. Secondly, we apply the classical Apriori algorithm based on association rules to the Web log analysis, and improve the traditional algorithm from three aspects: data preprocessing,data storage and parallel computing. The experiment shows that improved algorithm has better performance and execution efficiency in space and time. Finally, we combine the current mainstream big data processing platform Hadoop, Flume and Kafka technology, and integrate the Web log mining process , developed a set of effective Web log analysis system . We can automatically complete the log data cleaning, and analysis of the user’s access patterns and rules through the Web log processing analysis system. |