| PDF is a commo nly used mult i-platform document storage format,and is widely used in the issue of electronic documents and the dissemination of digital info rmat ion.Wit h the increasing popularit y of PDF,it has become a research hotspot to extract relevant structural informat ion,which can be used as an important data source for text extract ion,machine learning,PDF reconstruct ion and other applic ations.However,as a fixed text format,PDF does not provide relevant structure info rmat ion directly,so it is necessary to use the structure analysis method t o process PDF files.This paper takes the PDF structure of academic papers as the research object,and puts forward a whole set of structure analysis scheme for the full text structure of PDF.The main work is as follows:1.By analyzing the structure characterist ics of PDF documents,this paper proposes to divide the full text structure of PDF into three structural elements: text,i mage and table.According to the characteristics of each structure,this paper proposes a machine learning method based on the Mask R-CNN model to analyze the structure o f PDF documents.This method divides the three t ypes of PDF structures to build rel evant data sets,and trains the model cont inuously through model construct ion and p arameter opt imizat ion,thus complet ing the analysis and extract ion of PDF structure informat ion.2.In view of the machine learning method direct ly by PDF structure informat ion may not be accurate or situat ion that is not resolved,this paper further proposes to use crowdsourcing technology to custom PDF structure parsing,PDF structure was d esigned and implemented a custom resolut ion crowdsourcing system,and puts forward the structure of PDF self-adjust ing algorithm and structure of the blo cks of sort ing algorithms,Through crowd-sourcing mode to complete manual error correct ion and manual adjust ment,further make up for the machine learning method to parse PDF structure brought about by the deficiency.3.In view of the crowdsourcing model can bring a lot of crowdsourcing data,this paper designs and implements a majorit y voting scheme,is proposed based on page unit of voting algorithm,based on the structure of algorithm and IOU-structure block algorithm,by vot ing scheme design screen out most users for the same page of the PDF structure custom analyt ic results,In order to get more accurate PDF structure analysis data.In this paper,relevant experiments are designed for the above schemes.The r esults show that the machine learning method proposed in this paper can effect ively complete the structure analysis of PDF,and more accurate PDF structure informat io n can be further obtained through crowdsourcing mode and vot ing strategy.The analy tical scheme proposed in this paper is beneficial to further mining the informat ion in PDF academic papers,and also opens up a new way for PDF structure analysis. |