Font Size: a A A

Design And Implementation Of Expert Homepage Information Extraction System

Posted on:2020-04-07Degree:MasterType:Thesis
Country:ChinaCandidate:J ZhangFull Text:PDF
GTID:2427330626450730Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Industry-University-Research cooperation is an important part of improving the high-tech innovation ability of China's small and medium-sized enterprises.However,it faces difficulties in talent introduction.The disconnection between the government and the academic circle,the information asymmetry of scientific research institutions and enterprises are the main causes of this problem.Expert home page information on the Internet can help users perceive experts and support the introduction work.However,the expert homepage also has problems such as scattered site distribution and unclear text description.It is necessary to integrate all relevant homepage resources and extract effective information from them to provide users with more convenient and accurate expert information review solutions.In order to achieve the above objectives,this thesis designs and implements an expert homepage information extraction system based on Web information extraction technology.The system is essentially a sub-module of the expert information platform,which completes the framework of the expert portrait in the platform.Among them,the expert portrait is defined as a visualization page describing the general overview of the expert,the research direction,etc.,and the information extracted from this paper is combined.The main work of this thesis is as follows:(1)The system targets the list of experts given by the platform,and automatically determines the home site from the network query results.And combined with HTML structure,Chinese and English grammar to complete the web page text positioning,screening,standardization processing,to achieve data collection.(2)The preprocessing of the data includes the steps of constructing a corpus,annotating the data set,and selecting a feature vector.The system implements an automatic labeling scheme with the results of text parsing and rule matching.Considering the text semantics of the field and the context structure,the Word2 Vec,TF-IDF,POS,NER and other indicators are introduced to complete the feature vector selection.(3)In the extraction of homepage information,this thesis proposes a scheme for deciphering candidate fields by part of speech and outputting field labels through SVM and GBDT classification models.In order to improve the overall performance of information extraction,multi-group model weighted voting fusion is realized by setting different model parameters and training set sampling.(4)The extracted information needs to be integrated before it can be filled into the corresponding expert portrait.Considering the position of the field in the original context,the output tag and other parameters,an information integration algorithm is proposed.In this paper,the composition elements and placement positions of various types of information in the expert portraits are specified,and the visual display is realized through unified page design and structured data.
Keywords/Search Tags:Expert portrait, web information extraction, text analysis, model fusion
PDF Full Text Request
Related items