| In recent years,because of the rapid development of computer and information technology, people’s ability of obtaining data improves greatly. DataStream is a type of important data source, and is subjected to more and more concern.Stream data is a kind of continuous, changing fast, ordered and huge amount data. It is quite a new object that is different from traditional static data stored on the disk. Currently, data mining on data stream becomes a hottest field. Clustering data stream is one of the hottest research points on it.One target on this thesis is to design and develop a data stream clustering algorithm which is accuracy and high-speed. In order to reach this, we have done some work as follows: The related research background and meaning is discussed. The advantages, disadvantages and applicability of several type of popular clustering algorithms are summarized. The characteristics of data stream and key technical points on data stream clustering are researched. On the basis of these, we proposed a data stream clustering algorithm TD-Stream which based on density and grid. The algorithm borrowing the framework from CluStream algorithm, TD-Stream is divided into online layer and offline layer, The two layers work together to achieve the balance of accuracy and speed. Online layer reads data stream rapidly, and stores relative information by synopsis data structure. Through the introduction of the "trend degree", the method of computing grid density in the traditional density-grid based clustering algorithm was improved, new data reading algorithm compute the trend degree of the new data, and with this, it map the new data to correct grid, which can solve the problem of one grid belongs to more classes and the loss of information on the edge of grid result from based on the absolute grid. With the synopsis data structure which stored in online, offline layer provide accurate clustering. Density-based clustering algorithm is used, so that the system is sensitive to the datasets of arbitrary shape. The system can also satisfy the need of clustering and evolution history data stream with the concept of grid frame and evolution difference. Therefore, not only the high efficiency of the grid-based algorithm was utilized, but also the clustering accuracy was raised significantly. At last, we did some experiments based on both synthetic datasets and real datasets on the TD-Stream algorithm proposed in this paper, and the experiments results show that the algorithm is accuracy and high efficiency and can cluster data stream efficiently. |