| Intrusion detection technology as a proactive security measure has become a hot research field in information security especially in network security, and a favorable supplement of antivirus software and firewalls. Existing intrusion detection methods can be divided into two categories:anomaly detection and misuse detection. Misuse intrusion detection detects attacks by comparing the behavior of computer with the characteristics of known attacks. Although its detection rate is higher, it cannot detect unknown attacks. In contrast, anomaly detection can make up this gap. According to the normal behavior model of the system, it can detect intrusion by detecting whether it significantly deviates from the normal model. It is the hot spot and key of intrusion detection field research.Anomaly detection commonly uses machine learning or data mining methods and so on. The support vector machine has the outstanding performance, higher classification accuracy and generalization ability in machine learning. Therefore, SVM is widely used in intrusion detection. However, SVM also has many deficiencies. Deficiencies basically have the following two points:The test model depends on the study of the sample data, and it is very difficult to establish a complete sample at present; Sample data need to be classified and marked artificially, the work is very large, and lack reasonable machine classification method. So, to realize a real-time intrusion anomaly detection system based on SVM which can be applied in practical environment is a hard thing.In general, the unsupervised system uses the clustering method. Clustering method does not require the clean training data to train the system, and runs quickly, but the result is easily influenced by the parameters, and it is not easy to determine parameters and whether the minimum cluster is outliers, although in most cases the minimum cluster is outliers. Therefore, these methods based on clustering have the shortcomings of lower detection rate and higher false alarm rate.According to the above research, the key of this paper is how to combine SVM and clustering together effectively, in order to take full advantages of cluster's high speed and support vector machine's high precision, and achieve online anomaly intrusion detection. To solve the above problems, the main achievements of this paper include the following aspects:1. A new unsupervised intrusion detection algorithm based on cluster and support vector machine, and we have realized the core algorithm.Compared to the traditional intrusion detection methods, this algorithm has a high speed and good precision and the data don't need prior label. So the required data set is easier to collect, the algorithm has obvious advantages.2. This algorithm consists of three stages:data pre-processing stage, the training stage, the testing stage.In system implementation process, firstly the training data comes into the Data Preprocessing Subsystem, and then reduces the dimensions of data, quantifies discrete properties and normalizes data. Secondly, normalized data is clustered and divided into two categories, "normal" and "abnormal". Next, automatic labeled data is trained by SVM to form the target system's normal behavior model. Finally, the system predicts the network data (test set), then determines whether the data is intrusion.Data pre-processing stage mainly standardizes the network packet and makes ready for the training and testing stages. It consists of three parts:the selection of invasion feature, discrete properties quantitative and continuous normalize data. This paper uses the following method for feature selection:One input feature is deleted from the data at a time; the resultant data set is then used for the training and testing of the classifier. Then the classifier's performance is compared to that of the original classifier in terms of relevant performance criteria. The experiment proves that this method is feasibility and validity.The target system's normal behavior model is formed by training the training set in the second stage. Training algorithm is an unsupervised training algorithm for intrusion detection which is the combination of cluster algorithm and support vector machine; we cluster the data by hierarchical cluster algorithm. As the cluster algorithm is unsupervised, we label the training set by cluster. The standard SVM algorithm (C-SVM) is a supervised algorithm; the data need to be labeled. There is an intrinsic relation between them. Therefore, we can perfectly combine SVM and cluster algorithm, taking advantage of cluster algorithm's efficiency and SVM's high precision.The testing stage makes use of the model which is formed in the training stage to classify the new data. This stage only includes data pre-processing and SVM prediction two subsystems, doesn't include cluster subsystem. Firstly, the algorithm selects features according to the results of the invasion feature selection in the training phase, and then makes some pretreatment to meet specifications, finally uses the model generated in training phase to classify the new data and determine whether the data is intrusion.3. The experiment results verify the correctness and validity of this method.The data used in our experiments is KDD99 data set. In order to simulate the real network environment and meet the cluster hypothesis, this paper makes the proportion of the attack less than 5%, when selecting the training data and test data. We select the optimal characteristics and parameters for the algorithm, this algorithm is tested and evaluated from two aspects, which is the capability of detecting known and unknown intrusions. We compare the algorithm with other algorithms in terms of accuracy, normal data accuracy, abnormal data accuracy, training time, testing time five aspects. Large numbers of experiments show that using clustering and support vector together in the field of intrusion detection especially abnormal detection is correct and effective.Contributions of this paper are as follows:1) The program takes full advantages of the efficiency of cluster and the high accuracy of SVM, so the algorithm not only improves the detection speed, but also gains higher detection accuracy.2) The algorithm eliminates the insignificant or useless features of network data in order to focus on the most important feature, so its speed is improved3) The algorithm directly trains the raw data which is obtained by the intrusion detection system, and does not require labeling and strict filtering the training set, which greatly improves the effectiveness and practicality of the detection algorithm.4) It can analyze large amount of historical data directly without strict filtering, so it can be well applied in real system with no change. |