| Along with the vigorous development of the internet, the information on the network is likely to explode. There is sheer amount of data on the internet, it grows fast and update quickly with strong dynamic, so it is difficult for users to quickly and accurately get the information they need. In order to search the data that users need from this so big data bank exactly, and as far as possible to ignore irrelevant information, the technology of search engine emerges. As a tool to help people retrieve information, and an entrance of user accessing to the World Wide Web, the search engine’s goal is to achieve the highest possible network coverage, but high coverage causes it to provide users with too much useless information. In addition, the results that the traditional search engine retrieved about particular fields are not professional enough, and can’t meet the specific needs of particular fields and specific professional groups.In order to resolve the limitations of the traditional search engines, the paper designs and realizes a multithreading web crawler system based on theme, which is used to crawler the news and blog pages in the Internet. In order to achieve this system, this paper completed the following work:Secondly, according to the text duplicate removal needs of the system, this paper explores and researches the text duplicate removal technology, and proposes a new text duplicate removal method called duplicate detection for Chinese texts based on semantic fingerprint and LCS; Next, analyze requirements of the system, and master design the framework and function and database of the system according to the requirements;Finally, design the main modules of the system in more detail, include the detailed design of the functions and design of treatment process, at the same time, introduce the critical section of code, and show the operation interface of the system briefly.The multithreading web crawler system based on theme that researched and implement in this paper support of multi-task and multithread, and it supports user to configure parameters and themes of the system. The system can provide real-time news and blogs about a particular field.At last, this paper design experiments to test the data crawling rate and theme judged correct rate of the system, proves that this system can achieve a higher data crawling rate, has a high degree of accuracy and coverage rate in the judgment of theme, at the same time, this system has good text duplicate removal effect. |