Font Size: a A A

Research And Application Of Text Abstraction Technology On The Web

Posted on:2008-07-16Degree:MasterType:Thesis
Country:ChinaCandidate:L Z CuiFull Text:PDF
GTID:2178360215973888Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
As the exponentially increase of user's accessible online information, the traditional processing and management techniques on text data can not be satisfied to the various demands of users any longer.In demand of an efficient approach for searching the useful information, research and application of automatic abstraction has revived these years. One method of automatic abstraction based on the structure of the Web text and mechanical extract is given in this thesis. The deep research on some important technologies including the analysis of the text structure, extracting the key words, and forming abstraction is introduced. Then, the cluster on the obtained extracts is done. And the experiment results and the analysis are given. The main points of this thesis contain:(1) The definition and classification of the Web mining are abstracted at the beginning, as well as the background, basic concepts and the applications of the Web text mining. At the same time, the development history, style, technology and evaluation of the text automatic abstraction are elaborated. The automatic Web abstraction is the most important point in this thesis.(2) The Web text analysis methods by the Java regular expression based on analyzing the structure of the Web text thorough are described in this thesis, including the core implementation process.(3) Some research works of the automatic abstraction based on the Extended TF-IDF technology are done. Its main steps contain: leaching dispending words, computing terms' and sentences' weights, and ordering the sentences by the weight. The sentences with highest weight are taken as digest-sentence, which will be exported in accordance with their original text. According to the intrinsic evaluation, the TF-IDF technology works more efficiently.(4) Lots kinds of cluster methods are described, and the conclusion and comparison on their merits and defects are given. Meantime, the process of the automatic abstraction is given and the cluster operation on abstraction mentioned above by the classic method k-means is implemented. After comparing the results of being clustered with the results from Web text clustered directly, it proves that abstraction technology that puts forward in this thesis can work more efficiently.
Keywords/Search Tags:Web Text Abstract, Regular Express, TF-IDF, Web Text Cluster, k-means
PDF Full Text Request
Related items