Font Size: a A A

The Research And Realization Of Topical Search Engine Based On Page Segmentation

Posted on:2010-12-07Degree:MasterType:Thesis
Country:ChinaCandidate:L GaoFull Text:PDF
GTID:2178360272479037Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Search engine is a system collecting and collating the Web information resource, and then supplying in the inquiry. It has presented the unprecedented challenge to the general search engine when the Web information grows rapidly. The topical search engine is the development trend. But the current topical search engine all take the whole page as a unit when processing a web page, so it can't identify availably those content blocks which related topic and result in topic drift easily. Aim at above phenomenon, we apply the page segmentation to the focused crawling. When processing the content of a page, we don't take the whole web page as a unit, but content piece called block.While other page segmentation methods cannot do well in the page which contain many topics. According to this problem, our main works are bringing forward a new page segmentation method named CTVPS. The CTVPS that make use of the vision information, tag information and link information which in the web page. Another, we bring forward to a lot of heuristic rules to control the accuracy and grain degree of the block when segment a page. After page segmentation, we proposed new method to extract the topic, we applied page classification model to the content block classification and implement the block topic extraction. The results are satisfactory.The implementing of system Search Smart is based on open source SE named Nutch, which is stabilization , expansibility etc. finally, the experiment indicate that the block-based topical SE can get higher retrieval quality than page-based topical SE.
Keywords/Search Tags:topical search engine, page segmentation, ctvps, topic extraction, nutch, search smart
PDF Full Text Request
Related items