Research On Clustering Algorithm Based On Subspace In High-dimensional Data Streams

Posted on:2011-10-04

Degree:Master

Type:Thesis

Country:China

Candidate:W W Zhou

Full Text:PDF

GTID:2198330338490994

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

Data stream clustering is an important research field in the data stream mining. There still exist many problems for clustering data streams in the algorithms at home and abroad. For example, the inherent sparsity in high-dimensional data is not solved well, the clustering algorithm is inefficiency, data type is limited to numerical data, these needs of users cannot be meet and so on. To address these problems, the paper has mainly focused on how to cluster data streams based on subspace. This research has important meaning for e-commerce, Network Communication, Business Intelligence and so on.Firstly, to solve the problems that clustering efficiency and accuracy are affected greatly by the high volatility of the data stream flow rate and the current resource-constraint clustering environment as well as the sparseness of high dimensional data streams, we propose a new high dimensional subspace-based adaptive algorithm, called SAStream. We improve the cluster structure in HPStream and define the candidate clusters. We only compute the distance between the newly coming data points and the centroids of the candidate clusters instead of all clusters, so the number of examined clusters is reduced during clustering process. The created clusters are stored in Pyramidal time frame and time fading function is used to discount the history of past behavior. When the data rate is fast, the LimitingRadius and cluster selection factor adjust automatically, and the clustering granularity adjust all along.Secondly, to cluster high dimensional categorical data streams, we propose a new algorithm called SUBCStream. The compressed storage structures of the clusters are redefined in the paper. The symbol matrix and frequency matrix are used to store data. We can find the clusters and maximal relevant subspaces by minimizing the objective function. The additivity property of cluster structure is used to merge cluster structure or add new data points. In order to discount the history of past behavior and reduce the maintenance cost, we add fading functions for every cluster.Finally, SAStream and SUBCStream algorithms are implemented with language of Java. All of our experiments are performed on the real and synthetic datasets. The experimental results show the feasibility and effectiveness of our algorithms.

Keywords/Search Tags:

High-dimensional data streams, Data stream clustering, Subspace, Data rate, Adaptive, Fading factor

PDF Full Text Request

Related items

1	Study On Key Technologies Of Frequent Items Mining And Clustering On Data Streams
2	Research On Clustering Algorithm Of Data Stream
3	The Analysis And Application Of Clustering Algorithm For Multi-Dimensional Data Streams
4	Research On Clustering Algorithm Over High Dimensional Data Stream Based On Irregular Grid Data
5	A High Dimensional Data Stream Clustering Algorithm Of Quick Dimension Reduction
6	Research On The Algorithm For Mining Frequent Items From Data Streams
7	Research On Density-based Subspace Clustering Algorithm For Data Streams
8	Research On Density-Based Subspace Clustering Algorithm For Data Streams
9	A Density-Based Clustering Algorithm Over Stream Data
10	Research On Subspace Clustering Algorithms For High-dimensional Data