Font Size: a A A

Research On The Initialization Methods Of Clustering Centers Based On Outlier Detection

Posted on:2024-05-15Degree:MasterType:Thesis
Country:ChinaCandidate:Z Y YangFull Text:PDF
GTID:2568307142951789Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
The rapid development of information technology enables people to access more and more data.How to fully utilize this data has become a serious challenge faced by various industries.Data mining technology can extract significant knowledge from massive and heterogeneous data,thus has received widespread attention.Clustering analysis is an unsupervised learning method in data mining,which utilizes a specific methodology to partition a given sample set into several distinct clusters.Through the analysis of the clustering results,it can discover the associations within the sample set.In recent years,clustering,as it can handle various types of unlabeled data,has become one of the main branches of data mining.There are numerous clustering algorithms utilized for solving data mining problems,of which,partition-based clustering is currently the most commonly used method.Typical partition-based clustering algorithms comprise K-means,K-modes,K-prototype,and others.For partition-based clustering algorithms,the initialization of cluster centers(i.e.selection of initial cluster centers)is of great significance.If the selected initial cluster centers are inferior,it will easily lead to poor cluster structure,which will affect the clustering performance.Existing partition-based clustering algorithms still share some common issues with cluster centers initialization,mainly including:(1)inability to avoid both outliers being selected as initial cluster centers and multiple initial cluster centers being selected from the same cluster;(2)not considering the issue that different attributes could have different weights when calculating the density or distance between samples.Outliers refer to a small portion of data in a data set that significantly differs from other data.When initializing clustering centers,selecting outliers as initial cluster centers not only decreases the convergence speed of clustering algorithms but also potentially reduces the accuracy of clustering results.To address the issues in cluster center initialization of partition-based clustering algorithms,this paper investigates cluster center initialization from the perspective of outlier detection.Specifically,we propose three corresponding cluster center initialization algorithms for K-means,Fuzzy K-modes,and K-prototype,respectively,to handle clustering problems involving numerical data,categorical data,and mixed data.In our proposed cluster center initialization algorithms,we not only consider the outlier status of each sample but also assign corresponding weights to each attribute when calculating the density of samples and the distance between samples to reflect the differences among attributes.The main research work of this article is as follows:(1)K-means clustering center initialization algorithm based on outlier detection technology.To tackle the issue of initialization inherent in K-means algorithm while dealing with numeric data,we devise an outlier detection-based K-means clustering center initialization algorithm called IKM_OD(Initialization of K-means Based on Outlier Detection).Firstly,IKM_OD calculates the weight of each attribute using the granular combination entropy.Secondly,it calculates the distance outlier factor of each sample using distance-based outlier detection technology.Thirdly,it calculates the weighted density of the sample and the weighted Euclidean distance between the sample and the existing initial cluster centers.Lastly,it integrates the distance outlier factor,weighted density,and weighted Euclidean distance to compute the possibility of each sample becoming an initial cluster center.IKM_OD not only avoids selecting outliers as initial cluster centers but also boosts the caliber of initial cluster centers by utilizing weighted density and weighted Euclidean distance.(2)Fuzzy K-modes cluster center initialization algorithm based on outlier detection technology.To tackle the issue of initialization inherent in fuzzy K-modes algorithm while dealing with categorical data,we devise an outlier detection-based fuzzy K-modes cluster center initialization algorithm called IFKM_OD(Initialization of Fuzzy K-modes Based on Outlier Detection).Firstly,IFKM_OD calculates the weight of each attribute using partition entropy.Secondly,it uses an improved distance-based method on categorical data to calculate the distance outlier factor of each sample.Thirdly,it calculates the weighted average density of the sample and the weighted matching distance between the sample and the existing initial cluster centers.Finally,it uses distance outlier factor,weighted average density,and weighted matching distance to jointly calculate the possibility of each sample becoming an initial cluster center.IFKM_OD not only avoids selecting outliers as initial cluster centers but also addresses the problem of multiple initial cluster centers from the same cluster.Therefore,it provides an effective initialization mechanism for clustering on categorical data.(3)K-prototype cluster center initialization algorithm based on outlier detection technology.To tackle the issue of initialization inherent in K-prototype algorithm while dealing with mixed data,we devise an outlier detection-based K-prototype cluster center initialization algorithm called IKP_OD(Initialization of K-prototype Based on Outlier Detection).Firstly,IKP_OD calculates the weight of each attribute using granular neighborhood entropy.Secondly,it separately calculates the outlier factors of the sample on numerical and categorical attributes.Thirdly,it calculates the weighted density of the sample on numerical and categorical attributes,and respectively computes the weighted distance between the sample and the existing initial cluster centers on numerical and categorical attributes.Fourthly,it comprehensively considers the above six factors to calculate the possibility of each sample becoming an initial cluster center.IKP_OD not only avoids selecting outliers as initial cluster centers and ensures the selected initial cluster centers are representative,but also prevents the occurrence of several initial cluster centers that originate from the identical cluster.Therefore,it provides an effective initialization mechanism for clustering on mixed data.
Keywords/Search Tags:initialization of cluster centers, outlier detection, outlier factor, rough sets, weighted distance, weighted density
PDF Full Text Request
Related items