| Nowadays,the Internet possesses the largest volume of information in the world.When constructing domain search engine,domain knowledge base and doing researches on text analysis,we need to obtain massive text data related to specific domains or topics from the Internet as support.At present,automated methods for massive Internet information acquisition are mainly faced with the following three difficulties: Firstly,conventional search engines or crawlers search domain-related information only by keyword matching.However,simple combination of one or more keywords can not adequately express domain information and not fully consider the domain concepts which results in low accuracy.Secondly,there are a lot of unrelated content in webpages,such as navigation bar,advertisement links and so on,which results in low data quality and brings difficulties to webpage content extraction.Thirdly,there are no semantic labels for webpage texts,but in the subsequent application fields,such as semantic retrieval,information recommendation,etc.,all of them rely on semantic labels.Thus it is difficult for webpage text to support subsequent applications directly and effectively.To solve these problems,this thesis has proposed a framework for domain-oriented webpage content extraction and semantic label generation.It can effectively identify the target domain related webpages through the algorithm of link topic relevance prediction,extract the main content of webpages based on the text object models and finally generate the corresponding semantic labels of each text through the statistical and semantic features of the webpage content texts.The main research work of this thesis is as follows:1.Framework for domain-oriented webpage content extraction and semantic label generation is proposed.This thesis analyzes and summarizes the difficulties in the process of massive domain information acquisition under the Internet scenario,and proposes a framework for domain-oriented webpage content extraction and semantic label generation.The framework is divided into webpage collection layer,data extraction layer and semantic processing layer.It can effectively identify topic-related webpages,extract webpage contents and generate semantic labels.2.A domain ontology-based prediction algorithm for link topic relevance is proposed.Aiming at the problem of low accuracy in massive information acquisition,this thesis proposes a domain ontology-based prediction algorithm for link topic relevance.With the help of domain ontology describing a specific topic,it can get the topic relevance based on link URL,link text and link context which can improve accuracy effectively.3.A webpage content extraction method based on text object model is proposed.Aiming at a large number of irrelevant contents in webpages,this thesis compresses the text object model based on the text object model of webpages,and then identifies the main content of webpages through the density of text links.Finally,aiming at the clustered noisy links,a noise link recognition method based on node entropy is proposed to effectively detect noisy links.4.A method of generating semantic labels based on statistical and semantic features is proposed.Firstly,the semantic disambiguation method based on Word Net and Doc2 Vec is used to determine the semantics of ambiguous words in text.Then,considering the statistical and semantic characteristics,as well as the domain factors,semantic label weight is calculated,based on which semantic labels are generated.Finally,based on semantic labels,webpage content text is clustered to better support data application.5.A platform for domain-oriented massive information acquisition is constructed.Based on the framework proposed in this thesis,a prototype system of domain-oriented massive information acquisition platform is designed and implemented.The practicability of the framework is verified by displaying the related functions of the platform and comparison of different platforms. |