| In recent years, the Internet and communication industries have been more andmore flourishing with the development of science and technology. As a result, theamount of network resources shows an explosive growth trend. People are used to solveproblems with help of Internet when they need resources in certain fields. However,while the vast information brings users a lot of convenience, it also makes themconfused. It gets more and more difficult to locate targets rapidly among thousands ofwebsites. In order to solve this problem, this paper has reseached some relatedtechnologies of search engine classification display. A perfect category system is built toguide users, as thus to reduce the unnecessary waste of time.Classification display of search engine includes two modules. The first module isused to judge categories of websites, while the other one is used to build and retrieveclassification indexes. What first to be done is the pretreatment work to websites, andthese work aim to convert websites to vectors. This paper proposes an algorithm basedon web blocks, which can extract text content from website according to nodes in DOMtree. In the text content, there are some words which are useless to distinguish a websitefrom others. These words are filtered based on document frequency in this paper, and awebsite is expressed as the form of vector by the rest words. The second part ofclassification module is to train text classifiers. As support vector machine only appliesto binary classification, this paper adopts a decision tree-based method to expand binaryclassifiers. Features needed to train classifiers are selected by multiple feature selectionalgorithm. One feature owns different weights to improve classification accuracy whenit is in different hierarchies. In search engine module, the main function is realizedbased on Lucene, which is a open source search engine architecture. With the concept offiled in Lucene, classification indexes are built and categories of websites are store inthem. When users want to browse websites of certain categories, indexes are retrievedin corresponding fields to provide classification displayed results for users.At last, algorithms proposed are verified in experiments. The results are evaluatedby a series of data, which includes classification accuracy and sample recall rate inclassification module, and consumed time and retrieve accuracy in search enginemodule. The feasibility is proved to apply classification display in search engine. |