Research On Outlier Detection Technologies Via Information Entropy Theory In Complex Data Environments

Posted on:2024-10-25

Degree:Doctor

Type:Dissertation

Country:China

Candidate:R Li

Full Text:PDF

GTID:1528307358987979

Subject:Cyberspace security

Abstract/Summary:

With the high degree of integration of the physical world,cyberspace and human society,the problem of anomaly detection in the ternary world,or outlier detection,which is the mining of unusual,unconventional,and non-conforming to the expectations of errors,risks,or blunders,has become a key research topic in the field of data mining,and has been widely used in various fields,such as cyberspace,financial risk,social networking,industrial production,biomedical,and national security.However,in the ternary world,the massive and complex data environment is accompanied by complex data characteristics such as diversity,richness,redundancy,dynamics,and multi-source,which contains a large amount of uncertain information such as redundancy,noise,ambiguity,incompleteness,imprecision,and inaccuracy.While the huge and complex data environment brings information and knowledge,it also brings certain challenges to the outlier detection task.Information entropy theory,as a theoretical paradigm for measuring data uncertainty,has been widely studied and applied in the field of outlier detection in recent years.Existing outlier detection research based on information entropy theory mainly realizes the discovery and mining of outliers by measuring the degree of chaos,ambiguity,or roughness of the objects.Its core is to transform the outlier detection problem in complex data environment into the outlier quantification problem of data uncertainty degree.Using the information entropy theory to measure the data uncertainty information to realize the detection and discovery of anomalies is a core research topic with theoretical and engineering research value.However,the existing research on outlier detection technology based on information entropy theory is still in its infancy,and faces the following problems that need to be solved urgently:(1)Sifting out redundant and noisy information in massive data is the key to improving data quality and reducing computational costs,and existing research lacks the mining of key features and supervised outlier detection design in incomplete mixed data,which results in unnecessary information loss.(2)Missing data often hides valuable information that plays a key role in outlier detection.Existing studies tend to ignore the impact and influence of the incompleteness of missing data on the problem of unsupervised static outlier detection,so that outliers that are intentionally hidden and deleted can’t be detected and detected in a timely manner.(3)Compared with static data,dynamic time series data contains richer uncertainty information,and existing studies have not yet investigated dynamic time series uncertainty,which may affect the identification of timing outliers.(4)The incomplete and heterogeneous mixed information in multi-source data has not been effectively handled,and complementary,reliable and rich multi-source data fusion and the amount of accuracy of anomalies are still the research difficulties in unsupervised outlier detection.In response to the above challenges,this dissertation research aims to deeply explore the impact of data uncertainty information on outlier detection tasks and explore its key role in knowledge discovery and data mining.By applying the information entropy theory to quantify the uncertainty information in the data,we analyze and mine the valuable information and knowledge in the massive,variable,and complex data environment,so as to provide innovative solutions and technical support for the outlier detection research based on the information entropy theory.The research work of this dissertation research includes the following four points:(1)Aiming at massive redundant data,an outlier detection model based on attribute parsimony is proposed to address the lack of mining of key features and supervised outlier detection design in incomplete mixed data in existing research.Based on the attribute information of the data,the model starts from the correlation and redundancy information between attributes,and uses conditional entropy to mine the key features that have the highest correlation with the outlier classification task,the smallest redundancy with the known attribute information and the largest independence,so as to achieve the maintenance and optimization of the outlier detection capability through the improvement of data quality and the reduction of time cost.Experiments on real-world outlier datasets show that the model can not only avoid the loss and depletion of valuable information and achieve the screening and mining of key features by processing incomplete mixed data through the neighborhood information network,but also maintain or even improve the detection performance of the model to satisfy the relative balance between time cost and detection performance.(2)Aiming at incomplete missing data,a static outlier detection model based on multigranularity information is proposed in response to the problem that existing research ignores the role of incompleteness of missing data in unsupervised static outlier detection.Based on the structural information of the graph,the model starts from the coarse-grained single-attribute information and fine-grained multi-attribute information of the neighborhood information network,utilizes the neighborhood entropy to mine the valuable uncertainty information embedded in the missing incomplete data,and enhances the similarity between similar groups and weakens the correlation between dissimilar groups through the graph structure of the Markov random walk,to realize unsupervised scenarios of " Things are grouped together and people are divided into groups".Experiments on the real telecom fraud dataset in Sichuan show that the model can effectively mine the valuable incomplete information contained in the missing data,and satisfy the effective unsupervised static outlier detection under the scenario of containing really incomplete mixed data.(3)Aiming at the research problem of unsupervised time series outlier detection in the absence of dynamic uncertain time series information for dynamic time series data,a time series outlier detection model based on multi-dimensional dynamic uncertain information is proposed.The model starts from the three dimensions of "structural information of irregular mutation","temporal information of long and short cycle variability" and "attribute information of multivariate correlation",and utilizes the temporal entropy to mine the point outlier information in the temporal data.Mining point outliers,subsequence outliers and time series outliers in time series data to realize unsupervised time series outlier detection.Experiments on real medical brainwave,fraudulent web site and telecommunication fraud datasets show that the proposed model is able to mine important time series uncertainty information and realize the outlier detection task of dynamic time series data,which effectively fills the gap of the current research on dynamic uncertainty of time series data,and provides a new idea for the research of time series outlier detection.(4)Aiming at heterogeneous multi-source data,a multi-source fusion outlier detection model based on multi-perspective information is proposed to address the lack of heterogeneous multi-source fusion information in unsupervised outlier detection research.From the three perspectives of "neighbor connection relationship","object similarity relationship",and "affiliation fuzzy relationship",the model firstly uses the mixed entropy that integrates rough and fuzzy information to realize a multi-source fusion outlier detection model under incomplete mixed data from multiple sources.data fusion under multi-source incomplete mixed data,and then constructs finergrained object similarity relationship construction based on the more informative fusion data in order to realize the unsupervised outlier detection task under multisource data fusion.Experiments on sixteen real datasets show that the proposed model is able to obtain fused data with higher information content,better interpretability,and less redundant noise,and at the same time,it is able to mine and carve relationships between similar objects at a finer granularity to achieve outlier detection and discovery.To accurately evaluate the performance of the model,this dissertation research selects publicly available outlier detection data resources(UCI and ODD),and conducts experimental analyses in terms of outlier detection effect,outlier differentiation capability,ablation analysis,and statistical test analysis,to validate the effectiveness,superiority,applicability,and robustness of the proposed model in different data scenarios.

Keywords/Search Tags:

Data Mining, Outlier Detection, Complex Data Environments, Granular Computing, Information Entropy, Markov Random Walks

Related items

1	Outlier Detection Based On Markov Random Walk And Its Application
2	Research On Methods Of Data Mining Based On Granular Computing
3	Research And Application Of Outlier Detection Algorithm
4	Associations Mining Research Based On Granular Computing
5	Research On Granular Computing Based Outlier Mining Methods
6	Research And Application Of Outlier Detection Algorithm Based On Granular Computing
7	Granular Data Description For System Modeling And Data Mining
8	Study On Data Mining Model Based On Theory Of Granular Computing
9	The Outlier Detection Algorithm Based On Decision Outlier Factor And Markov Model
10	Research And Implementation Of Data Mining Algorithm For Environmental Information