Font Size: a A A

Design And Implementation Of Distributed-based News Crawler And Recommendation System

Posted on:2019-09-01Degree:MasterType:Thesis
Country:ChinaCandidate:Y ZhangFull Text:PDF
GTID:2428330545953697Subject:Computer science and technology
Abstract/Summary:PDF Full Text Request
The Distributed-based News Crawler and Recommendation System is one of the important components of the internal project "Wisdom Academy" set up by Shandong Academy of Sciences.The Wisdom Academy project aims to research and develop various hot technologies.promote the scientific research cooperation of different teams and institutes,and improve the talent screening and comprehensive evaluation,etc.by using advanced data processing technology.As a scientific research assistant,a cooperative staff and a talent hunter,the project helps the scientific research,service and talents of the Academy of Sciences,and promotes the fine management and business upgrading of the Academy of sciences.With the continuous development of machine learning technology and the intensification of market competition,the application of news recommendation in domestic and foreign markets is becoming more and more mature and stable.Nevertheless,in general news reading and recommendation applications,news applications are responsible for news sources,and users cannot make further customization requirements for news sources.The system of this article allows users to customize the news page of interest.The backend of the system deal with the demands of users,such as adding news sources of interest to the data source of the crawler,which would make it easy and accurate for users to access to the news of interest.At the same time,the system also provides users with information retrieval services and personalized recommendation services,allowing users to further reduce the difficulty of access to information.The main work of this paper is to provide accurate information retrieval and information push service for leaders at all levels,departments,scientific researchers from Shandong Academy of Sciences in the manner of web application,e-mail and other means,based on the user demand for a particular site definition(including address,keywords,push time),using the web crawler to crawl information.Relying on big data platform technology,this project builds a user behavior log collection and analysis system,in order to modeling the user's behavior.At the same time,we use Spark distributed computing framework to excavate and analyze the news data crawled from the internet.Finally,this project builds a personalized news recommendation system using the news model and the user's behavior model.This project combines the Internet news crawler,search engine,machine learning,data mining,log collection and analysis,recommendation system and other technologies,which means it is complicated to some extent,thus the whole system is divided into 5 subsystems.The news crawler system uses Nutch distributed crawler software to crawl Internet news data and redevelop the Nutch source code in order to accurately parse the fields of the news data,write the data of Chinese word into HBase,and create a news corpus.The news retrieval system uses distributed open source search engine software Solr to develop search API based on user's information retrieval requirements.The news features learning system uses MLlib,a machine learning library of Spark distributed computing platform,to preprocess and model the news data in the news corpus.We represent news's feature using the topic analysis model LDA.The user's features Modeling system uses the JavaScript language to collect the user behavior log,sends it to the backend of the system using the HTTP protocol,and uses the Spark Streaming technology to process the user log in real time for the purpose of modeling the users' behavior.The news recommendation system uses the news modeling results and users modeling results to calculate the user's preference score for the recommended news.Then,the system will recommend the news according to the scores and use the Spring framework to develop Restful API to return the recommended results for users.At the time of completion of this paper,the system has been successfully developed and used in Shandong Computer Science Center.The scheme proposed in this paper has some reference value for text recommendation system.
Keywords/Search Tags:Distributed Crawler, News Recommender System, LDA Model, Log System
PDF Full Text Request
Related items