Font Size: a A A

Pattern-Based Information Extraction From HTML Documents

Posted on:2017-03-26Degree:MasterType:Thesis
Institution:UniversityCandidate:SENG SopheaKFull Text:PDF
GTID:2308330485456330Subject:Software engineering
Abstract/Summary:PDF Full Text Request
The World Wide Web is a source of huge amount of unlabeled information spread across different sources in various formats. This presents as both great opportunities and challenges in leveraging such large amount of unstructured data to build knowledge bases and to extract relevant information. Information extraction (IE) systems serve as the front end and core stage in different natural language programming tasks. As IE has proved its efficiency in domain-specific tasks, this project is focused on one domain:trademark data extraction. Extraction patterns are then designed based on a study of the textual expressions and elements found in the text that appears before and after the target text.Interestingly, the web documents are mostly written in the Hypertext Markup Language (HTML) that doesn’t contain any means for semantic description of the content, and the contained information cannot be processed directly. Therefore, this system, Pattern-Based IE from HTML Documents, is focusing on the logical structure of an HTML document based on the visual information which is certain independence on the underlying HTML code and better resistance to changes in the documents. Moreover, it’s well-suited with Tree Matching algorithm applied for data extraction, and this project is built in JAVA using "WebSphinx API & Jsoup API" for retrieving HTML pages and parsing HTML texts.The experiment test resulted in very high performance for all the extraction tasks in general. The system performed with higher accuracy with entities, yet the data to be extracted might be failed also if the larger text elements are presented as unformatted text accompanied with mixed all kinds of characters.It can be concluded that Pattern-Based IE from HTML Documents is capable of delivering trademark data in higher accuracy in which it really solves the real problem of business context.
Keywords/Search Tags:Pattern-Based, Information Extraction, Logical Document Structure, Tree Matching Algorithm
PDF Full Text Request
Related items