Font Size: a A A

Study On Key Techniques Of Structured Information Extraction From Traditional Paper Based On Feature

Posted on:2012-10-11Degree:MasterType:Thesis
Country:ChinaCandidate:J G ChenFull Text:PDF
GTID:2248330371963981Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Scientific paper is an important form of the scientific technological research activities’output, is an important medium of promote the modern science technology transform into practical productive forces. Most scientific papers currently is edited by Word-processing software (such as Microsoft Word, etc), because of the Word format is an unstructured text, can not directly extract the title, author, abstract, keyWords, text and other elements of paper, it is difficult to meet the high-level applications of scientific papers such as structured retrieval, statistical classification, association analysis.This thesis mainly startwith arounding the structure extraction of traditional scientific papers. This thesis analyzes the basic structure and format features of the traditional scientific papers, and learns the extract rules based on feature. Design and implementation of a system for the feature-based intelligent structured information extraction from traditional paper, which can export structured text that meet the format requirements of multi-dimensional scientific papers from traditional paper.The innovation and main research work can be summarized as the following:1) Analysis the format features and storage standards of traditional scientific papers publish in the China Core Journals’s, study on the storage requirements for the structure multi-dimensional scientific papers, design and implementation an overall technical framework for the structured information extraction from scientific papers, which has a good scalability.2) Proposed an algorithm for structured information extraction from Word document based on feature. The algorithm consists of three parts: Firstly, Example Learning: Since the journals publish papers in different format to each other, we learning each journal’s example paper, identifying the text and format features for paper’s elements in the Word document, generating extraction rules and stored in rules document library;Secondly, Information Extraction: We select the extraction rule which corresponding with pre-extraction journal articles, extraction each paper’s element from the Word document. We also can batch extraction the traditional paper which is stored in the same directory of the journals;Finally, Generate Multi-dimensional Paper: Automatic generation to multi-dimensional scientific paper, which based on XML storage format and meets the multi-dimensional structure of scientific papers.3) Design and implementation of a feature-based structured information extraction from Traditional Paper system--XWordExchanger. System Integrated the Information Extraction Technology, XML Structured Technology and Machine Learning Techniques, system in good condition currently.
Keywords/Search Tags:Information Extraction, Traditional Paper, Structured, Feature Rules
PDF Full Text Request
Related items