| Outlier mining is an important part of data mining,and outlier refers to the data object that is inconsistent with others and have significant differences in a data set.With the development of data acquisition technology,the dimensions and amounts of data set increase rapidly.The accuracy of traditional outlier algorithm is severely affected by "dimension disaster",therefore these algorithms can not adapt to massive data set.In this thesis,the outlier mining algorithms and its parallelization based on relative subspace are studied by using Gaussian Mixture Model.The main work are as follows:(1)Relative subspace and outlier mining method based on Gaussian Mixture Model is proposed.Firstly,each data object's local dataset is calculated by K-Nearest Neighbors.Sparse degree matrix,which reflects sparse and dense of data set,is generated using the data object's attribute sparse degree.Secondly,relative subspace is redefined by Gaussian Mixture Model and sparse degree matrix,which can effectively adapt to various distributed data set.Thirdly,outlier's score of all data objects in the relevant subspace is calculated by the sparseness of each dimension and the weights of attribute.Outliers can be identified as data objects ranked on the first top-N with high outlier's score.In the end,experimental results validate the correctness and effectiveness of our algorithm on synthetic and UCI data sets.(2)A parallel outlier mining based on relative subspaces is proposed on Spark distributed computing framework.By using Resilient Distributed Datasets(RDD),the KNN,sparse degree matrix and relative subspace matrix which are calculated by various computing nodes,are cached in memory.The outlier score of all data objects is calculated in various computing nodes,so that the mining efficiency can be improved.The experimental results validate the scalability and extensibility of the algorithm on the stellar spectral datasets. |