| With the advent of the Internet age,the way people transmit information has changed from paper materials to electronic files.Due to the advantages of cross-platform,small size,and no typesetting confusion,PDF files have become the top spot in the field of electronic file formats once they are launched.PDF files are divided into digital PDF files and image PDF files.Digital PDF files can be converted from Word,Excel and other files,and image PDF files can be obtained by scanning paper documents with mobile phones or scanners.Due to the special way of generating image PDF files,there are some obstacles for users to use it.First of all,its content is essentially a picture,and users cannot copy or edit the text in the file.Secondly,due to the camera angle or uneven lighting of the mobile phone,the generated PDF file often has large shadows,which affects the user’s reading experience.Finally,the captured paper documents often have distortions such as folds and wrinkles,which increases the difficulty of content extraction.At present,the latest image processing technologies are almost all aimed at natural scenes,and there is little research on document images,and the existing PDF file processing software on the market does not have the function of shadow removal.Therefore,starting from user needs and social practical value,this topic implements a PDF content extraction system based on deep learning and other technologies.The main work includes the following aspects:1.Combine the research background of the subject and the current research status at home and abroad,and clarify the research direction and goals of the system.The technical selection of the system is carried out,and the division of functional modules is completed by analyzing the system requirements,and then the detailed design of the system is given.2.The BEDSR-Net model and related technologies are used to realize the shadow removal function of PDF file content.This model is the first and only deep learning model specially designed for document image shadow removal,which solves the problem of traditional heuristic manual design.Features remove the drawbacks of image shadows.3.A method of making document text image dataset is proposed for text detection and text recognition model training.Crawl the PDF files published on the Internet and randomly select the pages in them.Since these PDF files are digitally generated,we need to use the relevant libraries to convert the pages in the files into images,and then crop the images according to the text line coordinates to get high-quality images.Text line image.Then image enhancement is performed on these data to improve the generalization ability of the model.4.Using the tool library to operate the document object realizes the function of extracting text or tables in digital PDF files.Using OCR technology based on deep learning,the function of extracting text or tables in image PDF files is realized.According to the different needs of users,the system can process only part of the page range and generate files in different formats.The detailed processing flow chart is given and the extraction results are shown.5.The front-end page of the system is realized,which lowers the threshold for using the system,allowing users to use the system easily.From the perspective of system practicability,some common additional functions have been added,such as PDF merging,PDF splitting,PDF rotation,and PDF watermarking.Finally,the system’s interface,function and performance are tested,and the test results are given. |