Font Size: a A A

Design And Implementation Of Crawler System For Food Contact Material Safety

Posted on:2018-11-12Degree:MasterType:Thesis
Country:ChinaCandidate:W G GuanFull Text:PDF
GTID:2348330536952511Subject:Software engineering
Abstract/Summary:PDF Full Text Request
III In recent years,more and more safety incident caused by food contact material with hazardous substances happened repeatedly and hit the public sensitive nerve.It is very important to collect the relevant information in the network to monitor the safety of food contact materials by using the technology of topic crawler.On the basis of the actual project,The classification management system for food contact materials and products,this paper focuses on the design of crawler system oriented to specific topics.By reading a lot of literature,the research on the related technologies of the topic crawler is analyzed and summarized,and found that there are two major problems in the research of traditional topic crawler: 1)Studies on the selection of initial seeds are still lacking;2)The precision and recall rate of the crawling strategy still needs to be improved.In view of the above problems,a new solution is presented in this thesis.And on the basis of that,the key modules of the system are designed and implemented.Finally,the effectiveness of the proposed technique is verified by the related experiments and the running results of the system.The main contributions of this paper are as follows:(1)An initial seeds selection algorithm based on HITS algorithm is proposed.In this paper,the HITS algorithm is used to calculate the authority and centrality of web pages,the quality metrics of candidate seeds are defined,and selecting high quality links as seeds.But the original HITS algorithm is prone to "topic drift" problem.This paper improves the basic web page set expansion process in this algorithm,eliminates invalid links and evaluates the topic value of links.On the basis of that,links in the expansion are good,and the calculated results based on that will be more reliable.The final results of the system show that the proposed algorithm is effective.(2)The topic crawling strategy based on the concept context graph of comprehensive value is proposed.First,using formal concept analysis theory to extract concepts from the topic background,and then constructing a lattice by those concepts.In the next step,the concept lattice is transformed into a concept context graph based on the semantic similarity between concepts,and it will be used to store user query intention.At the same time,this paper improves the virtual formal concept matching algorithm,which can faster and more accurate calculation of the topic similarity of page.And comprehensive parent page,anchor text,link context and URL to define the link topic value prediction formula for determining the link access priority.Finally,the experimental results show that the proposed method is superior to the traditional topic crawler based on the concept context graph,and the efficiency and accuracy of the proposed method are significantly improved.(3)Using Java language to design and implement the crawler system based on Web Magic,the key modules of the crawler system is designed,including the initial seeds selection module,the concept context graph construction module,the crawling module,etc.,and The database storage scheme is designed in detail.In the construction of concept lattice,this paper use the Lattice Miner tool to deal with the formal background,and form the Hasse structure of the lattice.Finally,the crawling effect is verified,and the results show that the proposed strategy can effectively improve the efficiency and accuracy of the topic crawler,and it is successfully applied to the actual project.
Keywords/Search Tags:topic crawler, concept context graph, seed selection, crawling strategy
PDF Full Text Request
Related items