Research On Large-scale Web Information Extraction And Text Classification

Posted on:2017-04-29

Degree:Master

Type:Thesis

Country:China

Candidate:P Cao

Full Text:PDF

GTID:2308330488997108

Subject:Software engineering

Abstract/Summary:

PDF Full Text Request

With the rapid popularization of Internet technology, Web platform has been a global, huge, distributed and shared information space. How to extract valuable information from Web Pages is one of the hot topics in the Web-based application field. However, traditional information extraction methods are faced with challenges in big data environments so that efficiency and accuracy of these methods are badly reduced. At the same time, how to classify the text in the Web platform is also an important problem, traditional classification methods are facing serious challenges. In the view of above situations, this thesis focuses on Web information extraction and Web text classification in the large-scale data environment, the main innovation points are as follows:From the perspective of Web information extraction, a novel information extraction method based on node property and visual feature for large-scale Web Pages is proposed in this thesis which includes three parts:(1) Web pages is converted into a Document Object Model(DOM) tree, and a pruning and fusion algorithm is introduced to simplify the DOM tree;(2) Density property and vision property of each node in the DOM tree is defined and Web information are extracted based on these property values;(3) MapReduce framework is employed to realize parallel information extraction from large-scale Web pages. Simulation and experimental results demonstrate effectiveness and feasibility of the proposed method.From the perspective of Web text classification, this thesis proposes an efficient classification method that includes three parts:(1)Select the feature words from the complex network which is transformed from the Web long text;(2) For the long text, this thesis proposes a method which is based on kNN and SVM, then transform this method into a multi-class classifier in the way of decision tree;(3)For the short text, this thesis proposes a method that is based on the thesaurus which are selected from the long text. Simulation and experimental results demonstrate effectiveness and feasibility of the proposed method.Based on the theory and method above, this thesis designs a large-scale Web information extraction and Web text classification system, and shows the design of the system including demand analysis, general design, module design in detail and implement process. The test results show that the system has high stability.

Keywords/Search Tags:

Large-scale, Web Information, Information Extraction, Feature Selection, Text Classification

PDF Full Text Request

Related items

1	Based On The Rapid Large-scale Text Hierarchical Classification Problem Of Centralized
2	Research And Improvement Of Feature Selection Algorithm In Text Classification
3	The Research And Implementation Of Chinese Text Classification Based On Feature Selection And LDA
4	The Research Of Feature Selection Method In Text Classification Based On Triple-Play
5	Research And Improvement Of Feature Selection Algorithm In Chinese Text Classification
6	Research On Text Classification Of Web Text Mining
7	Research Of Hail Information Extraction Based On Sina Weibo
8	Research On Text Feature Selection And Classification Algorithms
9	Research Of Feature Selection For Text Classification
10	Research Of Feature Selection For Text Classification