Font Size: a A A

Research On Large-scale Web Information Extraction And Text Classification

Posted on:2017-04-29Degree:MasterType:Thesis
Country:ChinaCandidate:P CaoFull Text:PDF
GTID:2308330488997108Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the rapid popularization of Internet technology, Web platform has been a global, huge, distributed and shared information space. How to extract valuable information from Web Pages is one of the hot topics in the Web-based application field. However, traditional information extraction methods are faced with challenges in big data environments so that efficiency and accuracy of these methods are badly reduced. At the same time, how to classify the text in the Web platform is also an important problem, traditional classification methods are facing serious challenges. In the view of above situations, this thesis focuses on Web information extraction and Web text classification in the large-scale data environment, the main innovation points are as follows:From the perspective of Web information extraction, a novel information extraction method based on node property and visual feature for large-scale Web Pages is proposed in this thesis which includes three parts:(1) Web pages is converted into a Document Object Model(DOM) tree, and a pruning and fusion algorithm is introduced to simplify the DOM tree;(2) Density property and vision property of each node in the DOM tree is defined and Web information are extracted based on these property values;(3) MapReduce framework is employed to realize parallel information extraction from large-scale Web pages. Simulation and experimental results demonstrate effectiveness and feasibility of the proposed method.From the perspective of Web text classification, this thesis proposes an efficient classification method that includes three parts:(1)Select the feature words from the complex network which is transformed from the Web long text;(2) For the long text, this thesis proposes a method which is based on kNN and SVM, then transform this method into a multi-class classifier in the way of decision tree;(3)For the short text, this thesis proposes a method that is based on the thesaurus which are selected from the long text. Simulation and experimental results demonstrate effectiveness and feasibility of the proposed method.Based on the theory and method above, this thesis designs a large-scale Web information extraction and Web text classification system, and shows the design of the system including demand analysis, general design, module design in detail and implement process. The test results show that the system has high stability.
Keywords/Search Tags:Large-scale, Web Information, Information Extraction, Feature Selection, Text Classification
PDF Full Text Request
Related items