The Research Of Semantic Focused Crawlers Based On Membrane Computing Optimization Algorithm

Posted on:2014-05-26

Degree:Master

Type:Thesis

Country:China

Candidate:W J Liu

Full Text:PDF

GTID:2268330401481640

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

Focused crawlers selectively collect resources of web pages from theInternet, and these pages are related to the topic given by users, i.e. the contentsof these pages are interest for users. In classic crawlers, if there are no commonterms between the term set of a document and the term set of the topic, thesefocused crawlers decided that the document is irrelevant to the topic, i.e. therelevance between the document and the topic is0, but general semanticcrawlers can properly acquire the relevance between the document and the topic.However, there are still many problems in these semantic crawlers: the topicalsimilarity of anchor texts is local; the considerations of priorities of unvisitedURLs are not comprehensive; there are flaws in the calculation model of thetopical relevance of a document; it is casual for determining weighted factors inthe formula computing priorities of unvisited URLs.To address the above problems, the main research works of this paper areas follows:(1) The Semantic Similarity Vector Space Model (SSVSM) is proposed tocompute the similarity between a document and the given topic. SSVSMprimarily constructs document and topic semantic vectors, and the two semanticvectors correspond to the same semantic space, i.e. these semantic vectorscorrespond to the same double term sets and possess identical dimensions. Then,the product between the two semantic vectors is considered as the relevancebetween the document and the topic.(2) Cell-Like Membrane Computing Optimization Algorithm (CMCOA) isapplied to focused crawlers to optimize weighted factors of the equationcomputing priorities of unvisited URLs. In focused crawlers, CMCOA firstlytakes the vector comprised of all weighted factors as an object of eachmembrane. Secondly, it selects the optimal object by using communicationregulars and evolution regulars of each membrane, i.e. the root mean squareerror, corresponding to the optimal object, of training values and estimates of the topical similarity of training URLs is minimum. Finally, weighted factorscorresponding to the optimal object are considered as the best weighted factorsof the equation computing priorities of unvisited URLs.(3) Focused crawler based on semantic understanding and intelligentlearning is proposed. This crawling strategy takes the full texts of pages, anchortexts, titles of pages and link contexts are considered as the four documents ofhyperlinks, and the topical similarities of four documents and the correspondingfour weighted factors are combined as the rank priorities of unvisited URLs. Inaddition, the topical similarities of four documents are acquired by usingSSVSM, and the corresponding four weighted factors are acquired by usingCMCOA.

Keywords/Search Tags:

Focused Crawler, Semantic Similarity, Membrane Computing, Vector Space Model, Optimization Algorithm

PDF Full Text Request

Related items

1	Research On Search Strategy And Key Techniques Of Focused Crawler
2	Research On Focused Crawler Based On SVM Classification Algorithm
3	Research And Implementation Of Focused Crawler
4	Research And Implement Of Distributed Focused Crawler
5	Designing Focused Crawler Based On Improved Genetic Algorithm
6	Research And Implement Of Focused-crawler Relevance Algorithm In Search Engine
7	Research On The Topic Crawler Algorithm Based On Vector Space Model
8	Design And Implementation Of Multi Information Web System Of Automotive Industry Based On Focused Crawler
9	Research And Implementation Of On Semi-automatic Ontology Construction Base On WordNet And Focused Crawler
10	Research On Configuration Space Evolutionary Algorithm For Facility Layout And Focused Crawler