Font Size: a A A

Research And Implementation Of News Retrieval System

Posted on:2019-05-13Degree:MasterType:Thesis
Country:ChinaCandidate:G ZhangFull Text:PDF
GTID:2428330545464772Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the progress of Internet and news technology and the advent of the era of big data,the Internet has become the mainstream media of news communication with the characteristics of rapid and multifaceted dissemination of information,many channels of communication,many media and strong interaction meanwhile become the main way to get news to the general public.In the era of network media technology highly developed,each user in the news publishers,the number of Internet users is rising year by year,the number of users has proliferated news increase the amount of data,combined with network people get their news and information has become the most convenient the most main way,so people are eager to can get significant events news information precisely and quickly and timely access to nearby hot news information in real time.The design and research of the news retrieval system that meets the needs of users can better serve the socialist information construction and contribute to the national big data strategy.This paper studies and implements the news retrieval system with HDFS as the news storage medium.First of all,this paper studies the features of the web pages of major news websites and inquires into the principles and techniques of various search engines.Solr and other search engine technology will generate tedious index directory structure when setting up index,which is not suitable for the actual situation of HDFS reading and writing characteristics.The final selection is based on Hadoop platform to develop news retrieval system.The main functions of the system are the acquisition of website file data,index management and news search.HDFS in a distributed file system as a storage medium,storage climb from the website to get news and information,the use of graphs distributed programming model combined with the thought of the inverted index,the news filtered based inverted index,and finally to provide accurate and efficient news search service.The system designs a directory structure that stores information such as inverted index according to the directory structure of news storage and the characteristics of HDFS.When system inversion index is established,the keyword extraction and abstract generation algorithm are used to realize the characteristics of web pages.At the same time,the KMP improvement algorithm is used to improve the efficiency in the process of keyword extraction and abstract generation,and the MapReduce distributed programming model is used to achieve fast and accurate inverted index establishment.In order to improve the retrieval efficiency,a top-k news retrieval algorithm combined with BFPRT algorithm and MapReduce distributed programming model was proposed.Based on KMP improvement algorithm,the highlighting algorithm is implemented to quickly process and highlight the summary and title of the retrieval words in the results.Based on the research of the system,the implementation and testing,proof system in terms of function and performance meet the needs of news retrieval system,the system in actual use also showed good performance stability and efficiently.
Keywords/Search Tags:News Retrieval, Hadoop, BFPRT, Keyword Extraction, Improved KMP Algorithm
PDF Full Text Request
Related items