| China Unicom's internal office system platform involves many documents,contracts and project documents.Effective management of these documents is mainly based on the need to aggregate,easy to archive and print,while providing content retrieval to cope with the dramatic increase in the number of documents.In order to improve the efficiency of internal document management and optimize the internal office system,this project intends to design an off-line collection system of Web documents based on Web text extraction,aiming at putting forward an effective method of off-line collection and management of Internet documents.For the offline preservation and content retrieval of web documents,the design and implementation of this project is divided into three progressive stages: firstly,an effective Web text extraction algorithm is designed to extract the text of web pages and remove the content unrelated to the document;secondly,according to the extracted text,a general PDF document format file is generated,and According to the requirement of summary,merge multiple PDF documents to generate summary documents for archiving and printing.Finally,establish full-text index for PDF document content,provide content retrieval,and realize rapid document search.Correspondingly,the research contents of this project include the following three aspects:1.Design and implementation of single document offline collection processing.Single document offline collection completes the offline preservation of web documents,including downloading web content,extracting web text,and generating offline PDF files in general format based on extracted text.2.Design and implementation of multi-document offline collection processing.Multi-document collection processing can be regarded as batch single document processing,and finally merge multiple PDF files.Merging multiple PDF files requires specifying the order of merging.In order to improve the convenience of reading,we consider generating bookmarks at the same time,including multi-document merging description file generation,batch processing of multiple documents according to description file.Generate merged PDF documents.3.Design and implementation of full-text offline document retrieval.Full-text retrieval is to search related documents according to the content of documents.The sub-tasks involved include real-time discovery of new documents and incremental establishment of full-text index,inquiry according to user input,and output the list of related documents retrieved.In this paper,a complete offline collection system of Web documents is implemented.The process includes requirement analysis,system design,system implementation and system test.The test results show that the design and implementation of this project is basically available.The results of this project applied to the management of company knowledge documents will effectively promote knowledge management and sharing within the company. |