Design And Implementation Of Web Document Extraction And Offline Collection System

Posted on:2021-04-12

Degree:Master

Type:Thesis

Country:China

Candidate:C Ben

Full Text:PDF

GTID:2428330611998462

Subject:Software engineering

Abstract/Summary:

PDF Full Text Request

China Unicom's internal office system platform involves many documents,contracts and project documents.Effective management of these documents is mainly based on the need to aggregate,easy to archive and print,while providing content retrieval to cope with the dramatic increase in the number of documents.In order to improve the efficiency of internal document management and optimize the internal office system,this project intends to design an off-line collection system of Web documents based on Web text extraction,aiming at putting forward an effective method of off-line collection and management of Internet documents.For the offline preservation and content retrieval of web documents,the design and implementation of this project is divided into three progressive stages: firstly,an effective Web text extraction algorithm is designed to extract the text of web pages and remove the content unrelated to the document;secondly,according to the extracted text,a general PDF document format file is generated,and According to the requirement of summary,merge multiple PDF documents to generate summary documents for archiving and printing.Finally,establish full-text index for PDF document content,provide content retrieval,and realize rapid document search.Correspondingly,the research contents of this project include the following three aspects:1.Design and implementation of single document offline collection processing.Single document offline collection completes the offline preservation of web documents,including downloading web content,extracting web text,and generating offline PDF files in general format based on extracted text.2.Design and implementation of multi-document offline collection processing.Multi-document collection processing can be regarded as batch single document processing,and finally merge multiple PDF files.Merging multiple PDF files requires specifying the order of merging.In order to improve the convenience of reading,we consider generating bookmarks at the same time,including multi-document merging description file generation,batch processing of multiple documents according to description file.Generate merged PDF documents.3.Design and implementation of full-text offline document retrieval.Full-text retrieval is to search related documents according to the content of documents.The sub-tasks involved include real-time discovery of new documents and incremental establishment of full-text index,inquiry according to user input,and output the list of related documents retrieved.In this paper,a complete offline collection system of Web documents is implemented.The process includes requirement analysis,system design,system implementation and system test.The test results show that the design and implementation of this project is basically available.The results of this project applied to the management of company knowledge documents will effectively promote knowledge management and sharing within the company.

Keywords/Search Tags:

Web text extraction, PDF documents, Full-text retrieval, Off-line collection of documents

PDF Full Text Request

Related items

1	The Research On A Lucene-based Full-text Retrieval Model
2	The Research Of Full-text Search Engine Key Technology Based On Lucene
3	Research On Full-Text Retrieval Technology For XML Documents Based On Inverted Index
4	Research On Text Line Segmentation Method Of Tibetan Historical Documents Based On Rules And Learning
5	Printed Documents Source Identification Using Geometric Distortion On Text Lines
6	Noun phrases in documents: Preprocessing, automatic extraction, and statistical analysis in different categories of text
7	Application d'un reseau de neurones ARTMAP a la reconnaissance des commandes gestuelles d'edition de documents Braille (French text)
8	Methodology Of Full-text Retrieval And Identification Against Illegally Revealed Sensitive Files
9	Research On Distributed Full-Text Index System
10	Handwritten Text Detection In Natural Scenes And Historical Documents