| With the rapid growth of Internet on the amount of information, Internet search engines have become an indispensable tool for information retrieval, but people can not meet the personalized needs of information retrieval, so personalized search service based on users'interest has become the hot spots of study and development. This paper mainly introduced the construction of the Corpus based on Personalized Information Retrieval.This essay first analyzed the development of self-built corpus and expressed the importance of Corpus of the self-study for of information retrieval. Then it discussed and analyzed the key technologies and principle of collect information on Internet, introduced Web-based on the principle of automatic access to information and Web Spider.Clustering is the process of grouping the data into classes or clusters so that objects in the same group is similar, but have large difference in the different groups. User cluster is automatically clustering users according to their interests, formatting the groups who have the same interests. This thesis introduces the traditional method of user cluster. Cluster users by user interest model, analysis the limitations of traditional method, and present our method: cluster user based on the click-through data of users, take advantage of users'click records to cluster users, not through user interest model.Finally, data collection process for personalized search based on the Corpus of Sogou and experimental problems were claimed. The experimental results verify the correctness of the design approach and good performance. This thesis begins research and development of system based on data which one famous company provides. In accordance with this company's request on system, this thesis realizes the model of user cluster, formats the groups that have the same interest, lays the foundation for the future work. |