| With the coming of the Internet big data era, the booming network brings people awealth of information resources. Faced to huge amounts of Internet information, how toaccess the valuable information quickly and accurately has become a difficult problem.With the emerging of information retrieval system accordingly, search engines provide aconvenient way for information access. A great deal of research in this field has been doneby many scholars. In the process of information collecting, the general whole networksearch engine ignores the subject and the processing order of the information, caused thebroad disordered and uncorrelated results, we need secondary processing to obtain valuableinformation.In order to solve this problem, this paper studied the correlate method of informationretrieval, proposed a method, which could search more in-depth information for a certainfield, and implemented the dynamical maintenance and optimization of the informationindexes. Its main work can be summarized in the following three aspects:1) This paper studied the web crawler Nutch, the distributed computing frameworkHadoop and the work procedure of MapReduce, realizing the distributed crawl based onNutch, and storage the unstructured network information as structured file.2) The index building of information retrieval was achieved. The full-text indexingtool Lucene was researched, and reverse index for Nutch crawled text was constructed,laying the foundation for further index processing. Index pool model was proposed andconstructed, and index pool maintenance and dynamic optimization were achieved by theuse of index evaluation function, thus improving the quality of the index.3) In this paper, network information collection and search system was designed anddeveloped, and provided user’ interests sorted collection and timed information pushservice. |