| With the growing popularity of the Internet and the increasing demand for office automation, the application scope of electronic documents is continuously growing. PDF (Portable Document Format) ,as a kind of file format and platform independent electronic document, has become an important format of digital information transmission and storage. With the widespread influence of PDF, attack against PDF has become more and more popular. Among them, the most serious harm is caused by PDF documents with malicious code, which has brought the vast number of enterprises and users great losses.As a result, in the current background ,the detection technology of malicious PDF documents is becoming more. and more important.In this paper, the structure of PDF documents and PDF documents for the attack technology are studied. Compares the malicious PDF document and the existing detection methods, combines with the PDF document classification method at present, this paper puts forward a malicious PDF document static detection method based on logistic regression. Specific work is as follows:1.This paper summarizes the structure characteristics of PDF,analyzes the advantages and disadvantages of PDF detection technology.Combine with logistic regression, I put forward a kind of malicious PDF document detection technology based on logistic regression.2.I design and implement PDF document detection system, system requirement, system design, function and implementation of the key modules are described in detail.3.In the PDF document feature extraction module, combined with the PDF document format, select the structure path of the PDF document as the feature, and extract the features of PDF document. The extraction process uses breadth first algorithm to ensure the effectiveness of system.4.I study the popular feature extraction algorithm, choose chi square test as system feature selection algorithm. select important features which can be used to analyze the system.5.In the PDF classification detection module, I use logistic algorithm, which is widely used in the filed of machine learning. And through the simulation experiment, the validity of the system is verified from the test accuracy and time efficiency. |