Font Size: a A A

Research And Implementation On Clustering Algorithms In Uncertain Data Streams Environment

Posted on:2012-04-24Degree:MasterType:Thesis
Country:ChinaCandidate:K WangFull Text:PDF
GTID:2298330467477878Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
The data mining technology has attracted widely attention due to its ability to extract useful patterns from vast amounts of information. As an important data mining method, clustering is widely used in many applications. It can divide the data objects into groups according to the description information from users, so as to discover the relationship between data distribution and attributes.The understanding of data growing along with the advances in data collection techniques in recent years. As a result, the uncertainty in data cause for people’s increasing attention. Traditional clustering techniques can not be directly applied to the uncertain data, therefore clustering algorithms on uncertain data need to be studied. And in most applications, the data exists in the form of streams rather than be stored in databases. Because of the data in streams changeing in time sequence, speed variable and has a huge amount of number and so on, clustering uncertain data streams has higher requirements.For example, in data streams environment, the data reaches quickly which requires the clustering algorithm process each data fast. Especially, the cost of clustering algorithm is very expensive when the clustering objects are uncertain. In this thesis, two clustering algorithms in uncertain data streams environment are proposed from the point of view reducing the execution time.The MBR (Minimum Bounding Rectangle) is used to describe the distribution of the instances of the uncertain data point in this thesis firstly. And then, it’s proved that the expected distance between an uncertain data point and a cluster center can be instead by the distance between the MBR’s geometric center and the cluster center, and the error will not exceed half of the MBR’s diagonal. Based on the above opinion, a new algorithm is proposed to cluster uncertain data streams. The idea is to exclude some farther clusters according to the maximum and the minimum boundary of the expected distance, so as to achieve the purpose of reducing computational cost.In order to improve the performance of clustering algorithms further, according to the distribution of data points in the cluster, the conception that a cluster’s MBR is introduced. A new strategy to cluster uncertain data was proposed based on the spatial relationships between its MBR and a cluster’s MBR. There are three types of relationships between the two MBRs: contained, intersecting, disjoint. In order to improve the clustering performance, lots of clusters could be excluded by simple judgment of the relationships between the MBRs of an uncertain data and a cluster.Large amounts of experiments are carried on in the end of this thesis. The results show that both the two clustering algorithms could effectively reduce the computational cost, and thus shortening the execution time of clustering.
Keywords/Search Tags:data mining, uncertainty, clustering, data streams, expected distance, MBR
PDF Full Text Request
Related items