| The popularity of the Internet has gradually shifted some traditional activities to the Internet.But at the same time,the problem of network security is becoming more and more serious.Some illegal elements use the Internet to engage in illegal and criminal activities,such as fishing websites,domain name hijacking,Trojan horse and other means to extortion or steal users' personal information,and users suffer huge economic losses.In recent years,many zombie networks use domain transform technology to escape detection and blocking.Domain name transform technology often generates a large number of domain names through special domain name generation algorithm(DGA),which has brought great difficulties for security personnel to detect DGA domain names.In view of the above problems,by summarizing and experimenting the experimental data obtained,and drawing lessons from previous work,this paper extracts the characteristics of the significant difference between the DGA domain name and the normal domain name.Based on the extracted features,For the classification of the DGA domain name and the normal domain name in the experiment,we select the SVM algorithm.For the classification effect of the SVM algorithm,the effect of the support vector machine on the classification of the DGA domain name and the normal domain name is analyzed and verified experimentally.This paper firstly analyzes the characteristics of the characters and distribution of the DGA domain name and the normal domain name.Then the unsupervised K-means algorithm is used to cluster the domain name generated by the domain name generation algorithm and the normal domain name.First,the domain name is processed by the N-Gram method before the cluster analysis.Then,the domain name of the N-Gram method is analyzed and studied.Through the experimental results of cluster analysis,it is proved that the DGA domain name and the normal domain name can be effectively classified by the character composition and distribution characteristics of the domain name.It provides a feasible basis for categorization of DGA domain name and normal domain name based on the composition and distribution characteristics of domain name strings.Secondly,in view of the difference in character composition and distribution of DGA domain name and normal domain name,the following features(1)domain length are extracted;(2)entropy of domain name;(3)vowel alphabet ratio;(4)continuous consonant ratio;(5)digital proportion.The 5 extracted features can distinguish between the DGA domain name and the normal domain name.The classification model is trained by SVM algorithm,and the classification effect of the model is tested and verified.A good classification effect is obtained and the classification decision tree is used for the analysis and experiment based on the 5 characteristics.The experiment shows the classification effect of SVM and It is better than the classification decision tree.The optimized SVM model obtained by training can achieve good results in distinguishing DGA domain name and normal domain name.However,the classification effect of SVM model trained by 5 features is not very ideal when dealing with the short domain name of domain name of DGA domain.The feature added to the SVM model is retrained,hidden Markov characteristics are added to the SVM model for training,and the trained optimal model is tested and verified.The experiment shows that the model after adding hidden Markov characteristics can achieve better classification results in both the long DGA domain name and the short DGA domain name.The experiment and analysis show that the SVM classifier achieves better classification results after adding hidden Markov characteristics: the recognition accuracy of DGA domain names in each classification is more than 93.4%,the recognition accuracy of the normal domain name classification is more than 86.4%,the recall rate is above 89.8%,and the accuracy rate is above 89.9%.Compared with FluxBuster and DGA domain name recognition tools,experiments show that the proposed model is slightly better than the former two. |