| Focused crawlers selectively collect resources of web pages from theInternet, and these pages are related to the topic given by users, i.e. the contentsof these pages are interest for users. In classic crawlers, if there are no commonterms between the term set of a document and the term set of the topic, thesefocused crawlers decided that the document is irrelevant to the topic, i.e. therelevance between the document and the topic is0, but general semanticcrawlers can properly acquire the relevance between the document and the topic.However, there are still many problems in these semantic crawlers: the topicalsimilarity of anchor texts is local; the considerations of priorities of unvisitedURLs are not comprehensive; there are flaws in the calculation model of thetopical relevance of a document; it is casual for determining weighted factors inthe formula computing priorities of unvisited URLs.To address the above problems, the main research works of this paper areas follows:(1) The Semantic Similarity Vector Space Model (SSVSM) is proposed tocompute the similarity between a document and the given topic. SSVSMprimarily constructs document and topic semantic vectors, and the two semanticvectors correspond to the same semantic space, i.e. these semantic vectorscorrespond to the same double term sets and possess identical dimensions. Then,the product between the two semantic vectors is considered as the relevancebetween the document and the topic.(2) Cell-Like Membrane Computing Optimization Algorithm (CMCOA) isapplied to focused crawlers to optimize weighted factors of the equationcomputing priorities of unvisited URLs. In focused crawlers, CMCOA firstlytakes the vector comprised of all weighted factors as an object of eachmembrane. Secondly, it selects the optimal object by using communicationregulars and evolution regulars of each membrane, i.e. the root mean squareerror, corresponding to the optimal object, of training values and estimates of the topical similarity of training URLs is minimum. Finally, weighted factorscorresponding to the optimal object are considered as the best weighted factorsof the equation computing priorities of unvisited URLs.(3) Focused crawler based on semantic understanding and intelligentlearning is proposed. This crawling strategy takes the full texts of pages, anchortexts, titles of pages and link contexts are considered as the four documents ofhyperlinks, and the topical similarities of four documents and the correspondingfour weighted factors are combined as the rank priorities of unvisited URLs. Inaddition, the topical similarities of four documents are acquired by usingSSVSM, and the corresponding four weighted factors are acquired by usingCMCOA. |