| With the wide using of Internet and improvement of enterprise informationization, people request more effective service of Information. Especially, the data in the form of text play an important role in the Network. Management and analysis of text data has became the important subject. The analysis and applications of some special text (unstructured text data and Semi-structured text) have been the difficult point to the text mining. Text clustering is one of the most important technologies. Its idea is to divide the text which is similar into the same category.Firstly, this thesis reviews the research of the text clustering which is at home and abroad. At the same time, this thesis describes the related technologies of the text clustering, and analyzes and summarizes the advantages and disadvantages of clustering algorithms that are commonly used.As one of the most important algorithm of the text clustering, K-means is very simple, so it is widely used in practice for its applications. K-means has achieved a lot at the data in the form of text and image, but K-means has its Shortcomings, such as, its random selection of initial centers,detection and treatment of the isolated point. People put forward lots of improved K-means algorithm to deal with these problems. It is very important to choose initial centers for a good start to K-means. This thesis improves the K-means algorithm in the choice of the initial cluster centers. I raise an idea which is based on the density and the distance for the choice of the initial cluster centers. For the improved algorithm, firstly, I detect and deal with the isolated point that may exist in the sample data points. Secondly, the initial cluster centers are choused by the improved method. Finally, I choose the initial cluster centers by the improved method. The results show that improved algorithm has been more effective.At last, the improved algorithm is applied to a clustering model based on web news. First of all, the models pick up the body of the latest news which is raveled from the web, and I store them in the database by the form of txt text. Then, the original data is processed by Chinese word segmentation, feature extraction, Establishment of feature vectors and so on. At last, I cluster the Eigenvectors to obtain the cluster center. I build a vector model of the article to Calculation the Similarity of the cluster center, its purpose is to Classified the text which is similar in content as a class. I analysis the clustering results when the results appear. The results show that the improved algorithm is higher in the clustering accuracy rate than the original algorithm. |