Font Size: a A A

Research On DBSCAN Algorithm Under Non-Independent And Identical Distribution

Posted on:2021-01-18Degree:MasterType:Thesis
Country:ChinaCandidate:Y K LvFull Text:PDF
GTID:2428330602997178Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Clustering method is an important part of data mining.It is a very challenging research field that its purpose is to gather similar data objects together and separate dissimilar data objects as far as possible.DBSCAN is a density-based clustering algorithm with superior performance.The algorithm divides the region with sufficient density into data cluster,and the data cluster is the maximum set of data objects connected by density.The obvious advantage of the algorithm is that the clustering speed is fast,the noise points can be processed effectively,and the spatial clustering of arbitrary shapes can be found.However,there are some problems that because of the traditional DBSCAN algorithm assumes that the data objects and attributes are IID(independent and identically distributed),the traditional distance formula is difficult to accurately measure the similarity between category data objects and attributes,and the parameter sensitivity is difficult to determine.In order to solve the above problems,this paper studies the DBSCAN algorithm under non-independent and identical distribution.The unsupervised clustering problem of categorical data is solved by using the idea of non-independent and identical distribution.That is,the similarity between data objects and attributes is calculated by using the coupled and similarity formula of non-independent and identical distribution to output the coupled and similarity by the way of matrix.Neighborhood interval values(neighborhood interval lower limit Eps1 and neighborhood interval upper limit Eps2)and thresholds are used to divide the high-density data sets,so that higher quality clustering results can be obtained more quickly.For common categorical data,A Non-IID DBSCAN algorithm(DBSCAN under Non-Independent and Identical Distribution)is proposed.The array Rm is obtained by ascending sort the coupling similarity of the largest coupling similarity data object Om with other data object.The visualization method of array Rm is used to select the neighborhood interval lower limit Eps1,which is being observed that the curve first rises slowly and then leveled off,and finally it suddenly steeper somewhere.Then,the value of the coupling similarity corresponding to the sudden steepening of the curve is set as neighborhood interval lower limit Eps1.At the same time,find the k-nearest neighbor distance value to the right of Eps1(k takes the value of Minpts),and the maximum value was set as the neighborhood interval upper limit Eps2.The density formula is used to determine the sparsity density of the data,that is,the parameter threshold Minpts is set according to the size of the density value.Generally,if the density value is large,the threshold Minpts value is 4,if the density value is small,the threshold Minpts value is 2.For categorical data containing Boolean data,a NIB-DBSCAN algorithm(DBSCAN under Non-Independent and Identical Distribution for Boolean data)is proposed.By using scatter plots to fit the distribution of coupling similarity between data objects and attributes,find the boundary point between data cluster and data cluster,and set the value of their corresponding coupling similarity as neighborhood interval lower limit Eps1.For data sets with a small number of data clusters,the weighted average was used to select the parameter threshold Minpts.For data sets with a large number of data clusters,the parameter threshold Minpts is selected using the special value method.Finally,the experimental results of the UCI data sets show that DBSCAN under the non-independent and identical distribution can obtain higher-precision clustering results and improve the applicability of the algorithm.
Keywords/Search Tags:Non-IID, Coupling relationship, DBSCAN algorithm, Coupling similarity degree matrix, Principle of statistics
PDF Full Text Request
Related items