Font Size: a A A

Research On Metadata Extraction Approach For PDF Document Papers

Posted on:2013-12-02Degree:MasterType:Thesis
Country:ChinaCandidate:H Z LiuFull Text:PDF
GTID:2248330362962501Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
When we create the digital resource library based on the OA (Open Access)journals for information source on the Internet, according to PDF document in the OAjournals, fast, high quality extraction of metadata is the key to automatically generatethe digital resource library. OA journals of organization and management by metadatain the digital resource database can improve retrieval of papers accuracy and speed.Therefore, the automatic extraction of metadata is the current hot research problem forin the digital resource library construction. We, on the basis of previous research,propose two kinds of mixed strategies for metadata automatic extraction, analyze andverify their extraction performance in this paper.Firstly, aiming at the problems of low extraction precision and weak adaptabilityof the existing metadata extraction methods, we, based on three statistical learningmethods (i.e. HMM、SVM and CRF), propose a hybrid approach to extract metadatafrom PDF document papers. We firstly convert PDF format to TXT format of text, aimcharacteristic of TXT format and three statistical learning methods to selectcorresponding features, train and verify three methods by data set. Then we calculateeach single precision of metadata extraction for three methods based on the validatingresults, utilize the maximum rule to identify extraction method for each kind ofmetadata and generate hybrid extraction model. Finally, we use statistical methodbased on time period to dynamically update the hybrid extraction model in order toensure its effectiveness.Secondly, on the basis of this, according to the posterior probability generatedfrom each kind of statistical learning method for extracting paper metadata, we usesum rule to realize measurement fusion. We firstly derive sum rule based on Bayesiandecision theory and make fusion decision for the posterior probability generated fromHMM model, SVM model and CRF model trained by means of the sum rule so as toachieve metadata extraction of papers. Finally, by setting time period and the thresholdof document numbers, we dynamically update the three extraction models. Finally, using the online grabbed PDF papers, we analyze and verify the twometadata extraction methods’ performance; also, according to the number of papergroups to set time length, in order to get the result of the two hybrid extractionmethods’ adaptability...
Keywords/Search Tags:Metadata extraction, Statistical learning, Maximum rules, Measure level fusion, Posterior probability, Addition rules
PDF Full Text Request
Related items