Phishing is an attack that uses social engineering to illegally obtain victims’ private information by using deceptive emails or creating fake websites that look very similar to legitimate websites to induce users to enter a series of sensitive information into forged links.RSA reports that phishing incidents cost international organizations up to $9 billion in indirect economic losses worldwide.With the continuous update of detection technology,phishing websites also use domain name counterfeiting,code confusion and other methods to bypass the existing detection methods,resulting in some of the features are outdated,the role of phishing website detection is minimal,and even some of the outdated features will cause false positives.Some existing feature selection frameworks can effectively rank features by some metrics,but cannot obtain a reduced subset of features for real-time monitoring purposes of phishing sites.Therefore,how to automatically generate efficient feature subset,remove outdated features and improve the detection efficiency and stability of phishing has become an urgent problem to be solved.In view of this problem,the following results have been achieved:At present,the common feature selection methods can only rank features according to some metrics,and a cut off position must be determined for feature reduction.However,the previous studies basically adopted manual threshold setting,which could not automatically obtain an efficient and reduced feature subset problem according to the actual execution efficiency of the model,such as accuracy and time factors.This thesis proposes a hybrid feature selection framework to solve this problem.It includes four stages:data preprocessing,feature extraction,feature selection and model construction.The data preprocessing phase reduces the interference of irrelevant information by removing outdated websites and preset non-data characters.Feature extraction stage uses regular expression based method to extract preset 36-dimensional features from web links and content.In feature selection stage,a hybrid feature selection framework combining data perturbation and function perturbation is used,and an automatic cutoff position generation algorithm is added to obtain the optimal feature subset.In the model construction stage,the decision tree model is selected from a variety of machine learning models to train and predict binary classification through comparative experiments.Based on the proposed feature selection framework,design and implement a phishing website detection system.It includes:data preprocessing module,attack detection module and response storage module.After the deployment is complete,the system can automatically crawl the web pages submitted by users,quickly extract features,and accurately detect phishing websites.If a phishing website is identified,the detection result is reported to users.Compared with using all preset features,the detection time of this system can be reduced by 43.6%while the accuracy is basically unchanged,which can better meet the requirements of real-time monitoring. |