| Attribute extraction plays a vital role in the process of knowledge graph construction.It not only has important research significance,but also has broad application prospects.The main task of attribute extraction is to extract attribute value information of entities from unstructured text.Product attribute identification is an important step in the construction of knowledge graphs for e-commerce.However,the wide variety of products and the large scale of product attribute categories on the e-commerce platform make the task of product attribute recognition face many problems worthy of academic attention.First of all,when faced with large-scale data with many attribute categories and complex entities,the attribute extraction model lacks basic anti-interference ability.Secondly,the lack of labeled data is also a problem that needs to be solved urgently.Although a large amount of labeled data can be quickly generated through the method of remote supervision,there are inevitably problems of missing labels and wrong labels in these data.To address these issues,the main contents of this thesis are as follows:(1)Research on e-commerce attribute extraction based on machine reading comprehension.In the field of e-commerce,deep learning models are often faced with the task of large-scale commodity attribute extraction.Due to the large number of attribute types,the accuracy of the decoding classification layer of the traditional sequence annotation model will be greatly reduced.In order to alleviate this problem,this thesis constructs an e-commerce attribute extraction corpus by means of remote supervision,in which the product title data is the main data source.Based on the above-mentioned corpus,this thesis transforms attribute extraction into a machine reading comprehension task,converts the attribute into a question,and the attribute value comes from the text’s answer to the text.Incorporate the prior information of attributes in the process of text encoding,and obtain the attribute values of different attributes by constructing a model to ask fixed attribute questions.In addition,this thesis conducts an in-depth exploration of question construction methods and decoder selection to find the optimal question-decoder combination.This thesis conducts sufficient experiments on the e-commerce data set,and the results show that the model can comprehensively improve the accuracy and recall of attribute extraction.(2)Investigate unknown label-sensitive attribute extraction combined with self-training frameworks in distant supervision scenarios.This thesis focuses on the attribute extraction technology in the remote supervision scenario.The remote supervision method is based on the idea of "using the attribute value information extracted from the dictionary,and automatically labeling the text fragments containing these attribute value information as the corresponding attribute type".Remote supervision has the advantages of fast speed and manpower saving in the generation of labeled data.At the same time,there are problems such as the variable size of the dictionary and the introduction of labeling errors.This thesis proposes a model based on the self-training framework that is sensitive to unknown labels.In the process of self-training,the strategy of not learning unknown labels is adopted,and the label set is continuously updated during the iterative process,and the uncertainty and entity length are additionally added.as a pseudo-label filter.Experiments have proved that the model in this thesis can gradually improve the quality of the data set,alleviate the problem of missing and wrong labeling in remote supervision,and effectively improve the model’s ability to identify unregistered words.(3)Build an attribute extraction system for the e-commerce field.Based on the collected large-scale e-commerce texts from various sources,this thesis builds a corpus in the e-commerce field,and combines the previous research results to design and build an attribute extraction system in the e-commerce field based on the client/server architecture.In addition,the functions of various functional modules inside the system have also been tested and debugged.At present,the system has begun to go online,supporting users to conduct diversified inquiries.To sum up,this thesis focuses on the difficulties of product attribute extraction tasks in the e-commerce field,and proposes methods to improve the current attribute extraction system,constructs a data set to verify the effectiveness of this method,and builds a complete product attribute extraction system. |