Font Size: a A A

Research And Implementation Of Spark Real-Time Recommendation System

Posted on:2022-10-09Degree:MasterType:Thesis
Country:ChinaCandidate:F ZhangFull Text:PDF
GTID:2518306722972969Subject:Master of Engineering
Abstract/Summary:PDF Full Text Request
With the development of 5G and the layout of 6G,network data shows explosive growth.From "Internet plus" to smart city,continuous innovation and progress of science and technology have brought a lot of convenience to people's life.However,with the increasing amount of data,it is increasingly difficult for people to obtain information.At this time,recommendation system can help people quickly find the content and information they are interested in.How to collect effective information efficiently and accurately is very important to the recommend system,and the efficiency and accuracy of recommend system depend on the recommend system architecture and recommend algorithm.Early,Hadoop frameworks can meet users' requirements for mass data storage and accuracy of offline training,but with the increasing amount of data,Hadoop's Map Reduce processing speed significantly decreases,which makes it difficult to meet the real-time requirements of recommendations.Spark as a big data processing and analysis engine which would help solve the problem of slow disk read/write speed.At present,based on the Lambda construct the framework of real-time data processing framework,although it has a high rate of stability and fault tolerance and as well as the ability to separate the real-time and off-line prediction calculation,but due to the massive data,data collect with more and more difficult,and will produce a large number of intermediate files,so the server storage pressure increase greatly.In addition,when the user behavior changes greatly in a short period of time,the accuracy of recommendation will be greatly reduced.Among the existing recommendation algorithms,the typical collaborative filtering algorithm mainly aims to solve the recommendation prediction in the offline state.Although the accuracy of offline prediction is high,the similarity matrix needs to be rebuilt when the user preference changes,and the recalculation time will be greatly extended.Although the current recommendation algorithm can recommend different information according to different time periods,but under the background of the current COVID-19 pandemic,the recommendation results of Dianping do not consider whether the recommended information conforms to the current epidemic prevention and control criteria.Therefore,how to better realize real-time recommendation and optimize the recommendation results has become an important problem for the current recommendation system.For the given problems,this paper deeply studies the current mainstream recommendation algorithm and recommendation system architecture,and on this basis learns relevant knowledge of Spark ecosystem,designs and implements Spark Streaming real-time recommendation system.Firstly,the distributed crawler obtains relevant data of public comments,and uses Canal to monitor My SQL logs.Kafka message queues are constructed to consume real-time application data.Next,the results of the real-time calculations are stored in the My SQL database and the Elasticsearch index is synchronized.Finally,the Lambda architecture,Kappa architecture and the real-time recommendation system architecture is optimized to improve the accuracy and real-time performance of recommendations.The specific work of this paper mainly includes the following points:(1)In order to improve the efficiency of data acquisition,distributed web crawler based on Docker container is designed and implemented,and the operating efficiency of distributed crawler in Docker container and VM environment is compared.(2)The functional requirements of real-time recommendation system were analyzed in detail,the advantages and disadvantages of different real-time recommendation system architectures were compared,and the real-time recommendation system based on Spark was finally constructed,the system first constructed distributed crawler based on Docker to obtain the Dianping data;Secondly,Kafka message queue is used to consume crawler data and serve as real-time data stream cache module.Finally,Spark Streaming processing technology is used for real-time calculation to meet the purpose of real-time recommendation.The use of My SQL database for data storage and random access,combined with Redis database as data cache,improves system performance.In addition,the efficiency of crawler is improved by using Redis de-duplication mechanism.(3)In view of the impact of epidemic factors on recommendation results,ELK related technologies were used and recall strategies were tested and adjusted based on Elasticsearch search engine and the actual epidemic criteria,so as to optimize real-time recommendation results and display the final recommendation results in the Web front end.(4)Study The online-learning algorithm(Follow The Regularized Lead)FTRL,optimize The problem of data sparsity and cold start,and finally conduct real-time recommendation function and performance test on The data set acquired by crawler.Finally realize the expected design goal of the real-time recommendation system.
Keywords/Search Tags:Big Data, Docker, Distributed Crawls, Spark, Real-time recommendation
PDF Full Text Request
Related items