Research On Outlier Detection Algorithm For High Dimensional Data

Posted on:2023-03-27

Degree:Master

Type:Thesis

Country:China

Candidate:Z Qi

Full Text:PDF

GTID:2568306848970929

Subject:Computer technology

Abstract/Summary:

PDF Full Text Request

Outlier detection is one of the research hotspots in the field of data mining,which has attracted extensive attention in the fields of medicine,finance,telecommunications and so on.With the in-depth development of scientific research and the improvement of task complexity,the dimension and size of data are expanding,which brings great challenges to the task of outlier detection on high-dimensional data.And for different types of data,we need to develop different methods to complete outlier detection.For tabular data,although many related technologies have been proposed,most of them are faced with the problem that the neighborhood size of the object is difficult to determine and the distance in high-dimensional space is unreliable.For image data,based on the assumption that outliers are more difficult to reconstruct than inliers,a large number of methods based on the reconstruction principle have been proposed.However,with the increasing data size and dimension,the reconstruction method cannot be well applied in the realistic scene.The auto-encoder mainly focuses on the quality of sample reconstruction in pixel space,but does not care whether the encoded features only represent the features of normal samples.However,the features obtained by the auto-encoder may contain the shared features of outliers and inliers,resulting in the reconstruction error of outliers similar to inliers,which leads the model decision-making failure.In order to overcome above problems,this paper proposes a novel density-based outlier detection method,which introduces the concepts of Minimum the Sum of Edge Set and other related definitions in the key attribute space.Based on the stability of the Reverse Minimum the Sum of Edge Set,this method can adaptively select the parameters representing the neighborhood size.In addition,experiments on synthetic datasets and real datasets show that the proposed method has better performance than the existing algorithms,in which the attribute deletion method can be combined with other algorithms to improve the performance,and the k-parameter search algorithm can adaptively obtain the K value with good performance,and still maintain a high level of detection performance when dealing with missing values.For image datasets,this paper learns from the ideas of unsupervised clustering algorithm and contrastive learning algorithm.It no longer pays attention to the reconstruction loss at the pixel level,but considers the contrastive learning in the feature space.Different from the commonly used contrastive learning methods,this paper takes the samples in the nearest neighbor set as the positive samples without data enhancement.In the training process,different from the clustering algorithm,the purpose of this paper is to make the distance between inliers smaller and smaller or the similarity larger and larger,but there is no constraint between other types of data,and after each round of training,the model prediction results are returned to the model as pseudo labels.Experimental results on four datasets show that the proposed method outperforms the existing algorithms.

Keywords/Search Tags:

Outlier detection, Density-based, Contrastive learning, Pseudo labels, Nearest neighbor set

PDF Full Text Request

Related items

1	Research On Algorithms For Outlier Detection
2	Outlier Detection Algorithm And Its Parallelization Based On Weighted K-Nearest Neighbor
3	An Outlier Detection Algorithm Based On Natural Nearest Neighbor
4	Research On Contrastive Learning Method Based On Nearest Neighbor Optimization And Momentum Updat
5	Study On Generalized Nearest Neighbor Pattern Classification
6	Study On Algorithm For Rough Set-based Outlier Detection In High Dimension Space
7	Research And Application Of Outlier Detection Method Based On Nearest Neighbor
8	Density-based Outlier Detection On Uncertain Data
9	Research Of Density Peak Clustering Algorithm Based On K-nearest Neighbor Optimization
10	Research On Technology For Detecting Density-based Outlier