Research On Metadata Extraction Approach For PDF Document Papers

Posted on:2013-12-02

Degree:Master

Type:Thesis

Country:China

Candidate:H Z Liu

Full Text:PDF

GTID:2248330362962501

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

When we create the digital resource library based on the OA (Open Access)journals for information source on the Internet, according to PDF document in the OAjournals, fast, high quality extraction of metadata is the key to automatically generatethe digital resource library. OA journals of organization and management by metadatain the digital resource database can improve retrieval of papers accuracy and speed.Therefore, the automatic extraction of metadata is the current hot research problem forin the digital resource library construction. We, on the basis of previous research,propose two kinds of mixed strategies for metadata automatic extraction, analyze andverify their extraction performance in this paper.Firstly, aiming at the problems of low extraction precision and weak adaptabilityof the existing metadata extraction methods, we, based on three statistical learningmethods (i.e. HMMã€SVM and CRF), propose a hybrid approach to extract metadatafrom PDF document papers. We firstly convert PDF format to TXT format of text, aimcharacteristic of TXT format and three statistical learning methods to selectcorresponding features, train and verify three methods by data set. Then we calculateeach single precision of metadata extraction for three methods based on the validatingresults, utilize the maximum rule to identify extraction method for each kind ofmetadata and generate hybrid extraction model. Finally, we use statistical methodbased on time period to dynamically update the hybrid extraction model in order toensure its effectiveness.Secondly, on the basis of this, according to the posterior probability generatedfrom each kind of statistical learning method for extracting paper metadata, we usesum rule to realize measurement fusion. We firstly derive sum rule based on Bayesiandecision theory and make fusion decision for the posterior probability generated fromHMM model, SVM model and CRF model trained by means of the sum rule so as toachieve metadata extraction of papers. Finally, by setting time period and the thresholdof document numbers, we dynamically update the three extraction models. Finally, using the online grabbed PDF papers, we analyze and verify the twometadata extraction methodsâ€™ performance; also, according to the number of papergroups to set time length, in order to get the result of the two hybrid extractionmethodsâ€™ adaptability...

Keywords/Search Tags:

Metadata extraction, Statistical learning, Maximum rules, Measure level fusion, Posterior probability, Addition rules

PDF Full Text Request

Related items

1	Financial Tranxaction Information Extraction System Based On Rules And Statistical Models
2	Mining Algorithm Research For Association Rules Base On Interest Measure
3	Research On Metadata Extraction Approach From Papers Based On Ensemble Learning
4	Research On Language And Key Techniques For Accurate Information Extractionrules Towards Complex Web
5	Profit-Analyse Based Multi-level Association Rules Research
6	Research On The Optimization Of Association Rules
7	Tree Augmented Naive Bayes Classifier Based On Attributes Reduction Using Association Rules And Its Applications
8	Study On Approaches For Knowledge Uncertainty Measure And Rules Extraction Based On Rough Sets Theory
9	Research On Association Rules Mining Of Big Data
10	Research Of Attributes Reduction And Rules Extraction Of Decision Table Based On Granular Computing