| In this digital age,new technologies,applications,and platforms emerge one after another,constantly driving the development of the Internet.At the same time,the amount of data in Web logs is also increasing day by day,including a large amount of redundant information.How to accurately use clustering algorithms to mine potential valuable information from Web logs has become an important problem that needs to be solved.In response to this problem,this thesis mainly focuses on the following aspects of research content:(1)An improved AP clustering algorithm(IP-IAP clustering algorithm)based on the improved reference degree calculation and the divide-and-conquer idea is proposed.To address the issue that the reference degree median in the AP clustering algorithm cannot capture the characteristics of the dataset well,the IP-AP clustering algorithm is proposed by improving the reference degree calculation.To solve the problem of high complexity caused by computing the similarity matrix in the IP-AP clustering algorithm,the IP-IAP clustering algorithm is proposed by combining the divide-andconquer idea with the IP-AP clustering algorithm,and the K-medoids algorithm with improved centroid selection is used to divide the dataset,which reduces the time complexity of the algorithm.Finally,clustering analysis experiments on the UCI dataset are conducted,and the results confirm that the IP-IAP clustering algorithm has more significant clustering effects and superior robustness compared to the AP clustering algorithm,and also greatly reduces the time complexity compared to the AP clustering algorithm.(2)A MapReduce-based parallel IP-IAP clustering algorithm is proposed.To address the time-consuming iterative process of computing similarity and reference degree,updating attraction degree and membership degree,and calculating cluster centers in IP-IAP clustering algorithm,MapReduce programming model is introduced to parallelize the iterative process of the algorithm.Moreover,considering the working principle of MapReduce data partitioning is closely related to the divide-and-conquer strategy of IP-IAP clustering algorithm,which is more conducive to processing massive web log data.Then,to study the performance of the parallel IP-IAP clustering algorithm,parallel experiments were conducted on the NASA-HTTP dataset.Finally,by comparing the speedup ratio,it is confirmed that the algorithm has good clustering effectiveness and analysis performance in the face of massive datasets.(3)Application of parallelized IP-IAP clustering algorithm on the Jiangxi Province Personnel and Talent Integration Platform.To help the website allocate server resources effectively,this article builds a log collection and storage framework using Filebeat+Kafka+Logstash+HDFS technology,preprocesses the log data based on user identification and other preprocessing techniques,and then applies the parallelized IPIAP clustering algorithm for web log mining.Finally,it helps the website allocate server resources reasonably and expands or scales servers for different business modules at different times. |