Pattern-Based Information Extraction From HTML Documents

Posted on:2017-03-26

Degree:Master

Type:Thesis

Institution:University

Candidate:SENG SopheaK

Full Text:PDF

GTID:2308330485456330

Subject:Software engineering

Abstract/Summary:

PDF Full Text Request

The World Wide Web is a source of huge amount of unlabeled information spread across different sources in various formats. This presents as both great opportunities and challenges in leveraging such large amount of unstructured data to build knowledge bases and to extract relevant information. Information extraction (IE) systems serve as the front end and core stage in different natural language programming tasks. As IE has proved its efficiency in domain-specific tasks, this project is focused on one domain:trademark data extraction. Extraction patterns are then designed based on a study of the textual expressions and elements found in the text that appears before and after the target text.Interestingly, the web documents are mostly written in the Hypertext Markup Language (HTML) that doesnâ€™t contain any means for semantic description of the content, and the contained information cannot be processed directly. Therefore, this system, Pattern-Based IE from HTML Documents, is focusing on the logical structure of an HTML document based on the visual information which is certain independence on the underlying HTML code and better resistance to changes in the documents. Moreover, itâ€™s well-suited with Tree Matching algorithm applied for data extraction, and this project is built in JAVA using "WebSphinx API & Jsoup API" for retrieving HTML pages and parsing HTML texts.The experiment test resulted in very high performance for all the extraction tasks in general. The system performed with higher accuracy with entities, yet the data to be extracted might be failed also if the larger text elements are presented as unformatted text accompanied with mixed all kinds of characters.It can be concluded that Pattern-Based IE from HTML Documents is capable of delivering trademark data in higher accuracy in which it really solves the real problem of business context.

Keywords/Search Tags:

Pattern-Based, Information Extraction, Logical Document Structure, Tree Matching Algorithm

PDF Full Text Request

Related items

1	Pattern Extraction And Registration Of Formatted Document Image
2	Information Extraction System For Three Types Of Information Disclosure Announcements Of Listed Companies
3	Research On Structured Information Extraction Based On Pattern Matching
4	The Research On Logical Structure Analysis Of Document Image
5	Research On Techniques Of Automatic Data Record Analysis And Recognition For Accurate Web Information Extraction
6	Research On Condensed Sequential Pattern Mining Based On Tree Structure
7	Research And Application Of Web Information Extraction And Webpage Summarization
8	The Research Of Multi-pattern Matching Algorithm Based On Sequential Binary Tree
9	The Research And Improvement Of XML Tree Pattern Matching Query Algorithm
10	Key Technique Of Open Document Isomorphic Engine