| With the further advancement of network globalization,network attacks have become more frequent.How to accurately detect attack behaviors through network traffic data is a key technical problem that needs to be solved urgently.Combined with prior knowledge in the field of network security,universal machine learning algorithms are domain-specific and can accurately and efficiently detect attacks.This is becoming a research hotspot in network traffic data anomaly detection.However,network traffic data generally has the features of noise and complex types in the current complex network environment,which poses severe challenges to anomaly detection algorithms for network traffic data.One is that the network traffic data contains a large number of noise features,which will conceal the expression of outliers and seriously affect the accuracy of anomaly detection;the second is that the network traffic data contains both categorical and numerical features,with complex data types and different feature spaces There is a natural barrier between them,which cannot be handled uniformly.In response to these problems,this article focuses on achieving the goal of anomaly detection of network traffic data,conducts in-depth research on anomaly detection algorithms based on feature selection and anomaly detection algorithms based on reinforcement learning,and conducts network traffic anomaly detection prototypes on a distributed stream processing platform The realization of the system.The main research contributions are as follows.Network traffic data tends to have high dimensionality and contain many noise features,which can conceal real anomalies and affect the performance of anomaly detection.Existing feature selection algorithms are usually suitable for pure categorical or numerical data,and cannot be directly used for network traffic data that contains both categorical and numerical features.This paper proposes a novel outlier detection method based on feature selection in network traffic data,termed ODNTD,which adopts domain knowledge in feature selection.ODNTD includes a three-stage process of decompositionaggregation-decomposition: In the decomposition process,the categorical feature space and the numerical feature space are respectively used to score anomaly by feature value frequency and depth autoencoders;in the aggregation process,a dynamic aggregation strategy is adopted.Combine the two types of anomaly scores,and then use the threshold function to obtain the anomaly candidate set;when decomposing again,combine domain knowledge and machine learning algorithms to perform coarse-grained screening and fine-grained selection of anomaly candidate sets in various feature spaces to obtain features Subset.Repeat the above stages until the experience error no longer reduces.ODNTD iteratively exchanges information in the categorical feature space and the numerical feature space and embeds domain knowledge into the feature selection algorithm.Experiments show that compared with the existing anomaly detection algorithms on the real network traffic data set,the ODNTD algorithm has an average increase of AUC_ROC by 46.2% and an average reduction of the number of features by52.0%.In the field of network traffic anomaly detection,it can provide some domain knowledge summarized based on the characteristics of known anomalies.Existing algorithms usually simply superimpose domain knowledge and machine learning,and use fixed anomaly thresholds in machine learning.The anomaly thresholds cannot be dynamically adjusted according to the actual scene,which affects the full integration of domain knowledge and machine learning,resulting in abnormal values obtained Inaccurate,resulting in poor performance.Because reinforcement learning can perform decision optimization in complex scenarios to obtain optimal parameters,this paper proposes a novel Anomaly Detection method based on Reinforcement Learning,termed ADRL,which uses reinforcement learning to dynamically search for thresholds and accurately obtain anomaly candidate sets,fusing domain knowledge and machine learning fully and promoting each other.ADRL uses prior domain knowledge to label known anomalies,combined with known anomaly information,uses information entropy to score anomalies in categorical feature space,and uses deep autoencoders to score anomalies in numerical feature space,and then dynamically aggregates the anomaly score of the space.To obtain an accurate anomaly candidate set,ADRL uses reinforcement learning to search for the best threshold.ADRL initializes the anomaly threshold,obtains the initial anomaly candidate set,and conducts frequent rule mining on the anomaly candidate set to form new knowledge,and uses the obtained knowledge to correct the anomaly score.According to the anomaly score correction situation,different anomaly threshold modification strategies are implemented,and the best threshold and its corresponding anomaly candidate set are finally obtained,and the machine learning model is updated with the anomaly candidate set.Repeat the above process until the abnormal candidate set is stable.Experiments show that compared with the existing anomaly detection algorithms on the real network traffic data set,the ADRL algorithm has an average increase of 89.6% in AUC_ROC and an average increase in AUC_PR by286.0%.To further verify the research results of this article,based on the distributed stream processing platform Storm,Human-in-the-Loop Outlier Detection System,termed ADRL,was implemented based on the design of the distributed stream processing platform Storm.abnormal detection.The HLODS system first uses the features selected by the ODNTD algorithm as the basis and converts the network traffic into a fixed-length feature vector stream through feature extraction.Due to the good generalization performance of the SVM algorithm,the ability to solve high-dimensional nonlinear problems,and the ease of incremental learning,the HLODS system uses the unsupervised algorithm ADRL and the supervised algorithm SVM as the basic algorithm and uses the known abnormal data as the SVM pre-training data.The long feature vector stream detects anomalies through the ADRL algorithm,and provides the anomalies to experts for manual marking,and then uses the manually marked data to update the SVM model,uses the updated model to perform anomaly detection,and finally compares the ADRL results with the SVM results Integration,to achieve the collaborative work of the two.Experiments show that compared with the existing unsupervised anomaly detection algorithm on the real network traffic data set,the HLODS system has an average increase of 68.4% in AUC_ROC and an average increase of 3763.1% in AUC_PR;HLODS system is compared with the existing supervised anomaly detection Compared with the algorithm,AUC_ROC increased by 33.9% on average,and AUC_PR increased by 107.3% on average. |