Outlier Mining And Parallelization Based On Gaussian Mixture Model And Relative Subspace

Posted on:2019-05-18

Degree:Master

Type:Thesis

Country:China

Candidate:P P Fan

Full Text:PDF

GTID:2428330566976374

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

Outlier mining is an important part of data mining,and outlier refers to the data object that is inconsistent with others and have significant differences in a data set.With the development of data acquisition technology,the dimensions and amounts of data set increase rapidly.The accuracy of traditional outlier algorithm is severely affected by "dimension disaster",therefore these algorithms can not adapt to massive data set.In this thesis,the outlier mining algorithms and its parallelization based on relative subspace are studied by using Gaussian Mixture Model.The main work are as follows:(1)Relative subspace and outlier mining method based on Gaussian Mixture Model is proposed.Firstly,each data object's local dataset is calculated by K-Nearest Neighbors.Sparse degree matrix,which reflects sparse and dense of data set,is generated using the data object's attribute sparse degree.Secondly,relative subspace is redefined by Gaussian Mixture Model and sparse degree matrix,which can effectively adapt to various distributed data set.Thirdly,outlier's score of all data objects in the relevant subspace is calculated by the sparseness of each dimension and the weights of attribute.Outliers can be identified as data objects ranked on the first top-N with high outlier's score.In the end,experimental results validate the correctness and effectiveness of our algorithm on synthetic and UCI data sets.(2)A parallel outlier mining based on relative subspaces is proposed on Spark distributed computing framework.By using Resilient Distributed Datasets(RDD),the KNN,sparse degree matrix and relative subspace matrix which are calculated by various computing nodes,are cached in memory.The outlier score of all data objects is calculated in various computing nodes,so that the mining efficiency can be improved.The experimental results validate the scalability and extensibility of the algorithm on the stellar spectral datasets.

Keywords/Search Tags:

Outlier Mining, Gaussian Mixture model, Relative Subspace, KNN, Spark

PDF Full Text Request

Related items

1	Contextal Outlier Mining And Parallelization Based On Weighted Probability Density
2	Research On Outlier Mining Algorithms Based On Subspace And Its Application
3	Research Of Local Outlier Mining Algorithm Based On Spark
4	Outlier Mining Method Based On Gini Indexes And Sub-space Research
5	Research On Algorithms For Subspace Clustering And Outlier Mining Based-on Information-entropy
6	Study On Outlier Detection In Subspace
7	Optimal Subspace Outlier Mining Algorithm Based On Entropy Increment And Local Attribute Weighting
8	Adaptive Gaussian Mixture Model And Its Application In Speaker Recognition
9	Research On Local Outlier Detection Algorithm Based On Subspace
10	Research On Gaussian Mixture Model Based Location Estimation Algorithms For WSN