Font Size: a A A

The Research On SomeKey Issues In High Dimensional Data Clustering

Posted on:2012-01-04Degree:MasterType:Thesis
Country:ChinaCandidate:M X XieFull Text:PDF
GTID:2210330371962653Subject:Cartography and Geographic Information Engineering
Abstract/Summary:PDF Full Text Request
The development of clustering methodology is truly due to the interdisciplinary endeavor. Such as taxonomy, social science, psychics, biology, statistics, mathematics, computer science and so on. The aim of clustering is to find structure in data and is therefore exploratory in nature. Clustering has a long and rich history in a variety of scientific fields. In spite of the prevalence of such a large number of clustering algorithms, and their success in a number of different application domains, clustering remains a difficult problem. This can be attributed to the inherent vagueness in the definition of a cluster, and the difficulty in defining an appropriate similarity measure function that is also the main difference between clustering algorithms of every sort and kind. Because the traditional clustering algorithms are aimed at low dimensional data, they will be confronted with severe challenge and their clustering results are unanticipated when the dimension of data is very high. This dissertation makes the high dimensional clustering as key, points out some of the emerging and useful improved directions those make the traditional clustering algorithms be suitable for high dimensional space, researches the similarity measurement and feature transformation and proposes the reasonable approach for high dimensional clustering. It proves the validity and feasibility of the high dimensional data clustering algorithm proposed in this dissertation through the clustering analysis of several data sets in the machine learning database provided by UCI. The main contents and innovations of the dissertation are as follows.1. There are not comparisons between distances of the objects with the increasing of dimension when the distance measurement for low dimensional space is adopted in high dimensional space. Some dimensions those values of attribute are very different play an important role in the traditional distance or similarity measurement. But those dimensions are always existed in high dimensional space that leads to the correct relation of similarity between high dimensional objects be submerged. The study of efficient distance measurement or similarity (dissimilarity) measurement for high dimensional space is very important and challenging. Euclidean distance is adopted to measure the distance between the objects by the existing high dimensional clustering commonly. Because of the curse of dimensionality, the traditional Lk -Norm is unfit for high dimensional space. For the sake of reasonably measuring the distance or similarity between objects for high dimensional space and resolving the invalidation of Lk -Norm, we can redesigning distance or similarity function that must be meaningful for high dimensional space, and also be convenient for computing.2. Whether the similarity measurement can describe the similar relation between objects reasonably or not is due to the quality of data partition. Meanwhile, the quality of data partition is estimated by whether the partition accords with the data distribution. So we say that correct partition is the precondition for achieving the real similarity between objects. The similarity measurement based on partition has been improved through unequal partition and the improved function is proposed to measure the similarity between objects in high dimensional space. First, partitions the data equally based on the histogram of data distribution. Then, merges the near districts those values are under the threshold. Finally, gain the unequal partition according with data distribution. The similarity measurements for all kinds of data have been integrated by the proposed function based on unequal partition. It not only takes full advantage of the traditional functions in dealing with numerical data, binary data and categorical data, but also takes the relative distance into account. Meanwhile, unequal partition according with data distribution avoids the problems when the data is equal or ill distributed.3. Translates the dimensionality reduction process into the optimization and design the fitness function and resolves this optimization problem with genetic algorithm. During the process of optimizing, Euclidean distance between 2D objects approximates the shortest distance between relevant high dimensional objects. In other words, the similar relations between high dimensional objects are preserved in 2D space. This dissertation designs the method of dimensionality reduction for high dimensional data based on genetic algorithm and RBF neural network, and gains the dimensionality reduction converter. We can make use of the converter to gain the low dimensional coordinate of the new high dimensional object quickly and effectively. During the process of dimensionality reduction in practice, the quantity of data may be very large, so we choose some of objects stochastically as samples and use neural network to gain the converter through coordinate pairs of samples in order to improve the efficiency of dimensionality reduction.4. High dimensional clustering algorithm based on the improved similarity measurement and feature transformation is proposed in this dissertation. Firstly, gains the similarity matrix of high dimensional data using the similarity measure function designed in the dissertation, and translates it into distance matrix. Constructs the graph of distance matrix through the nearest neighbor searching method and gains the distance matrix of the shortest path based on the algorithm Floyd. Then, translates the dimensionality reduction process into the optimization and design the fitness function, resolves this optimization problem with genetic algorithm. Finally, the reduced data is used for clustering analysis via k-means and the value pairs between the coordinates of high dimensional data and their reduced 2D coordinates are used for RBF neural network training, saves the trained neural network. Determines the belongingness of new object based on the distance from the new object to each current clustering center through the trained neural network. In the beginning of clustering, we can visualize the dimensionality reduced data to supervise the choosing of initial clustering centers and number of clusters in order to enhance the clustering precision and efficient. If the distance matrix or similarity matrix is gained, we can take advantage of the existing clustering algorithm to cluster the high dimensional data through dimensionality reduction conveniently.
Keywords/Search Tags:High Dimensional Data, Clustering, Similarity Measurement, Dimensionality Reduction
PDF Full Text Request
Related items