Font Size: a A A

Research And Implementation Of Academic Search Engine Based On Nutch

Posted on:2012-03-12Degree:MasterType:Thesis
Country:ChinaCandidate:S Q XiaFull Text:PDF
GTID:2268330425991605Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
In recent years, with the rapid development of computer technology, the internet has affected every aspect of people in work and life. In order to make better use of internet resources, search engine came into being. However, the traditional general search engines have many problems, such as low coverage of internet, poor precision rate and error navigation. As a result, vertical search engines that offer services for a particular area appear. In the academic filed, to take full advantage of a large number of academic resources that are shared by research institutions and research personnel through internet, many vertical search engines aimed at academic areas have appeared one after another. But because of the update time lag, browse or download access control, poor timeliness and other reasons, now there isn’t any vertical search engine for the academic filed whose usage rate is as high as general search engines like Google.For the above actual reasons, this thesis does some research and implements a new vertical search engine based on Nutch for the academic filed. This academic search engine is able to provide users with timely and highly relevant searching results. And its user customized module can ensure update frequency. The major work of this thesis includes the following aspects:(1) In order to ensure the broad range of information gathering, this academic search engine’s crawling module crawl the entire internet. In this way, it breaks some academic-oriented search engines’information gathering limitations, which only aim at limited academic websites. In this condition, this academic search engine achieves the function of topic crawling. The topic crawling module introduces the web page topic relevant filtering mechanism by using a similarity calculation method based on semantic gravity. This method determines the relevance between web pages and the topic by computing the similarity of pages and the topic related words. Determining web pages’topic relevance in the crawling stage is the best way to achieve vertical search engines now. The achievement of topic crawling module in the whole internet environment not only ensures the topic relevance of the gathered web pages, but also makes the academic search engine be able to discover new topic relevant websites from the internet.(2) In the analysis module, this thesis implements a common web page parser based on template. This parser overcomes templates’dependence to the structure of web pages and websites, so it succeeds in the process of semi-automated template creation and has some versatility. In addition, this template parser can be customized to just collect web information that interests the users, such as web pages’updated time. Therefore, this parser can be customized and applied to different search engines.(3) This thesis succeeds in rebuilding of the original Lucene index structure of Nutch and adding a customized date filed for the index files of the academic search engine. Then the retrieval module implements a sorting method based on this date filed so as to add a search function that retrieved by the web pages’timeliness for the academic search engine. Furthermore, the retrieval module also achieves a sorting method considering both the importance of web pages’content and web link and overcomes some academic-oriented search engines’shortage in sorting.(4) User customized module implements the management and configuration capabilities throughout the whole academic search engine. System operating parameters and the seed URLs can be configured visually here. And users can customize and filter seed URLs according to keywords. What’s more, this module provides the seed URL suggestion function, which allows users to recommend new seed URLs to the academic search engine. When configuration is finished, users can run the academic search engine directly by the user customized module. This ease of use allows users to determine the updated frequency for the system according to their actual needs and ensures the timeliness of the information provided by the academic search engine.Through practical deployment and application, the academic search engine based on Nutch has achieved the desired goals. After analyzing the academic search engine’s search results, it can be verified that the main functions of each module have already been successfully achieved. Users can get more relevant and timelier news and information about their concerned academic areas using the academic search engine based on Nutch. Meanwhile, the academic search engine also has good scalability and versatility. Functional additions, deletions and improvements can be done conveniently. And this vertical search engine is able to be used in other areas after further modifications.
Keywords/Search Tags:Nutch, search engine, Chinese word segmentation, URL filtering, sorting
PDF Full Text Request
Related items