| As an important research content of data mining,outlier refer to the data which are inconsistent with others and have significant differences in a given dataset.With the explosive growth of data volume and data dimension,the shortcomings of traditional outlier mining algorithms are becoming more and more obvious,and they are difficult to be adapt to massive and high dimensional data processing.Traditional outlier mining methods focus on the efficiency and precision,but the interpretability and comprehensibility of the mining results are rarely addressed.In this thesis,a parallel contextual outlier mining algorithm has been studied by using relevant subspace.The main research works are as follows:(1)A contextual outlier data mining algorithm based on MapReduce programming model is proposed.Firstly,the relevant subspace of the data object is determined by the local sparse difference degree,and the outlier factor of the data object is calculated in this relevant subspace.And,the outlier factor and the correlation attribute dimension set in the relevant subspace are defined as contextual information.Secondly,selecting N data objects with the largest outlier factor as the contextual outlier.Thirdly,a parallel outlier mining algorithm is implemented by using MapReduce programming model.Finally,the experimental results verified that contextual information could improve the interpretability and comprehensibility of the outlier on the UCI dataset.(2)A contextual outlier data mining algorithm based on relevant subspace is proposed by using in-memory computing platform Spark.The KNN,the matrix of local sparse degree and local sparse difference degree are cached in memory by using RDD,which impoved the efficiency on outlier mining and reduced I/O cost.The experimental results verified the scalability and extensibility of the algorithm on the stellar spectral dataset. |