Study On Key Techniques Of Structured Information Extraction From Traditional Paper Based On Feature

Posted on:2012-10-11

Degree:Master

Type:Thesis

Country:China

Candidate:J G Chen

Full Text:PDF

GTID:2248330371963981

Subject:Software engineering

Abstract/Summary:

PDF Full Text Request

Scientific paper is an important form of the scientific technological research activities’output, is an important medium of promote the modern science technology transform into practical productive forces. Most scientific papers currently is edited by Word-processing software (such as Microsoft Word, etc), because of the Word format is an unstructured text, can not directly extract the title, author, abstract, keyWords, text and other elements of paper, it is difficult to meet the high-level applications of scientific papers such as structured retrieval, statistical classification, association analysis.This thesis mainly startwith arounding the structure extraction of traditional scientific papers. This thesis analyzes the basic structure and format features of the traditional scientific papers, and learns the extract rules based on feature. Design and implementation of a system for the feature-based intelligent structured information extraction from traditional paper, which can export structured text that meet the format requirements of multi-dimensional scientific papers from traditional paper.The innovation and main research work can be summarized as the following:1) Analysis the format features and storage standards of traditional scientific papers publish in the China Core Journals’s, study on the storage requirements for the structure multi-dimensional scientific papers, design and implementation an overall technical framework for the structured information extraction from scientific papers, which has a good scalability.2) Proposed an algorithm for structured information extraction from Word document based on feature. The algorithm consists of three parts: Firstly, Example Learning: Since the journals publish papers in different format to each other, we learning each journal’s example paper, identifying the text and format features for paper’s elements in the Word document, generating extraction rules and stored in rules document library;Secondly, Information Extraction: We select the extraction rule which corresponding with pre-extraction journal articles, extraction each paper’s element from the Word document. We also can batch extraction the traditional paper which is stored in the same directory of the journals;Finally, Generate Multi-dimensional Paper: Automatic generation to multi-dimensional scientific paper, which based on XML storage format and meets the multi-dimensional structure of scientific papers.3) Design and implementation of a feature-based structured information extraction from Traditional Paper system--XWordExchanger. System Integrated the Information Extraction Technology, XML Structured Technology and Machine Learning Techniques, system in good condition currently.

Keywords/Search Tags:

Information Extraction, Traditional Paper, Structured, Feature Rules

PDF Full Text Request

Related items

1	Study On Feature Oriented Modeling Of Traditional Artistic Design
2	Technology For Domain-oriented Automatic Information Extraction From Semi-structured Web
3	Ontology-Based Structured Information Extraction From Web Pages
4	Information Extraction For Semi-structured Chinese Resume
5	Research On Methods Of Semi-structured Data Implication Rules Extraction
6	Research On Language And Key Techniques For Accurate Information Extractionrules Towards Complex Web
7	Research On Feature Extraction Method Of Semi-structured Document
8	Research On Keyword Extraction And Structured List Data Extraction
9	Design And Implementation Of Feature Extraction System For Large-Scale Structured Data
10	Study Of The Literature Of Traditional Chinese Writing Paper