Font Size: a A A

Design And Implementation Of Real-time Recommendation System Based On Spark

Posted on:2020-03-25Degree:MasterType:Thesis
Country:ChinaCandidate:X YuFull Text:PDF
GTID:2428330626450671Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the rapid development of the Internet,the information carried by network is also growing explosively.In the face of these massive information,it is often difficult for people to find the content they want and are interested in.In order to solve this problem,the search engine was born.Users with clear purpose can search for the content they want quickly and accurately by keywords.However,in daily life,the needs of most users are vague and potential.In order to mining the potential interests of these users,the recommendation system was born.Recommendation system is a tool to solve information overload.It can help users find information that they may be interested in.It can greatly reduce the time spent by users to find content of interest,thus increasing the stickiness of the website.The key to determine the performance of recommendation system is system architecture design and recommendation algorithm.Traditional recommendation system based on off-line batch computing platform Hadoop can get more accurate recommendation by calculating massive data,but its calculation time is too long to meet the real-time requirement of recommendation.The real-time recommendation system based on Lambda architecture adopts the strategy of combining offline batch processing layer with online real-time layer,although it has the advantages of accurate calculation and high fault tolerance,it still has also some problems,such as the difficulty of summarizing results,the complexity of the system,the difficulty of maintenance,and the recommendation results depend on the offline layer,if user behavior changes greatly in a short time,the result of the offline layer will lag,which results in the recommendation cannot reflect the change of user's interest in time.In the aspect of recommendation algorithm,the commonly used recommendation algorithms,such as collaborative filtering,are only proposed to solve the recommendation task in offline environment at first.Each recommendation is based on the whole score matrix to calculate the similarity of items or users.When the dimension of the matrix is too large,the calculation cost will become very high,and thus is difficult to meet the real-time recommendation requirements.In the data stream environment,the score matrix will change frequently,resulting in frequent changes in user similarity and item similarity.How to update the results of recommendation in real time and minimize unnecessary calculation in this environment has become an important problem for recommendation algorithm.In view of the above problems,based on the in-depth study of the architecture design and recommendation algorithm of the recommendation system,this paper designs and implements a complete online computing layer based on the micro-batch stream data processing ability of Spark Streaming.Compared with the Lambda architecture,the system does not rely on the offline layer,which makes the complexity of the system greatly reduced,and there is no problem of difficulty in summarizing the results,and the design based on the full online layer improves the real-time performance of recommendation.The work of this paper is as following:First,the requirements of the real-time recommendation system are analyzed in detail,and the architectures available for real-time recommendation are demonstrated and compared.On this basis,based on Spark,Kafka,HBase,a real-time recommendation system architecture consisting entirely of online layers is proposed,which uses Kafka message queue as the data cache module to solve the problem of real-time data stream instability in practical application scenarios;uses HBase,a column-oriented database that supports random storage,as the storage module to meet the requirements for data read and write performance during data processing;uses the micro-batch stream processing ability of Spark Streaming to calculate the stream data in real time,to meet the throughput and computing latency requirements of real-time recommendation system.Then,the problems of collaborative filtering algorithm in the data flow environment are studied in depth.Based on the basic idea of collaborative filtering,a recommendation algorithm that can filter data stream of Spark Streaming is proposed.The algorithm uses Hoeffding bound theory to filter part of the stream data that has less influence on the results,and incrementally update the recommendation result by a computing method of item similarity based on user's consistency of positive and negative evaluation.Thus,The recommendation result can be updated within seconds of delay.Finally,based on the research of real-time recommendation system architecture and recommendation algorithm,the real-time recommendation system based on Spark is implemented.After setting up and deploying the relevant development environment,the system is tested by Movielens dataset.The results show that the system can update the recommendation under the premise of ensuring the accuracy,and meet the requirement of real-time recommendation.
Keywords/Search Tags:Spark, Spark Streaming, Collaborative filtering, Real-time recommendation
PDF Full Text Request
Related items