| As society enters the age of the Internet,people’s access to information has become more diverse,and more and more people rely on the Internet to obtain the information they need.At the same time,the rapid growth of the information volume has caused users to trouble in the information retrieval.In the face of mass retrieval results,users often cannot obtain their required information efficiently and accurately.For this reason,this thesis focuses on the research for the automation generation of the abstract of Web news.This thesis analyzes the shortcomings of the TextRank algorithm,as well as the multifeature combination algorithm.Then,a news abstract algorithm combining BM25 with text features is proposed,and the comparing experiments with five different algorithms are conducted.Finally,a Web news summarization system with the proposed algorithm is developed based on the Heritrix framework.First of all,this thesis introduces the significance and background of this research topic,and the current research status and major achievements of automatic text summarization in the world.Then,it introduces the related knowledge of automatic text summarization,including the categories and methods of automatic summaries.It also introduces the method of gathering news pages with web crawler and web page content extraction method.In Chapter 3,the thesis firstly introduces the main idea of the cx-extractor algorithm and its advantages over traditional methods.Secondly,it analyzes the shortcoming of TextRank algorithm only considering the internal structure of the text when scoring sentences,and discovers that the method of calculating sentence similarity in TextRank is not reliable.Based on above,a novel news summary algorithm combining BM25 with text features is proposed.In addition,the proposed algorithm is improved further because negative results may be obtained by BM25 and BM25 may be meaningful when the sentence is too long.In Chapter 4,the thesis uses ROUGE evaluation tool to compare the proposed algorithm with other related algorithms.The experimental results show that the proposed algorithm has better performance than other methods.Finally,in order to make the proposed algorithm into practice,this thesis designs and implements a Web news automatic summary system based on the Heritrix framework.The system includes such modules as web page collection,text extraction,graph model representation and sentence weight calculation.The system can collect news webpages in real time and automatically extract abstract from news webpages,and display the abstract through HTML pages. |