| The history of Inner Mongolia,as a bright pearl in the fertile soil of the Chinese nation’s history,has a long cultural heritage.The current historical records are all heavy texts,and it is difficult for non-professionals to sort out and understand semantic relationships.For example,the first chapter is about Tore,the son of Genghis Khan,and the second chapter is about Kublai,the son of Torre,so if you want to know that Genghis Khan and Kublai are grandsons,you need to read two chapters.In fact,the idea of RDF(Resource Description Framework)to describe the objects and relations in the objective world in the way of triple set satisfies the requirement of describing the relationship between historical knowledge and things with a simple binary relational model.For example,the above corpus RDF can be expressed as <Genghis Khan,Son,Torrey>,<Torrey,Son,Kublai>,obviously we can get <Genghis Khan,Grandson,Kublai>.Indeed,as a metadata language,RDF’s triple expression contains semantic information,and is not limited by specific grammar representations.It has multiple serialization methods and is easy to reason about relationships.Therefore,this thesis focuses on the RDF triple extraction,and designs a deep learning triple extraction model for the historical knowledge field of Inner Mongolia to complete information extraction and form a structured triple representation.The specific work is as follows:(1)In order to avoid the problem of error transmission in the pipelined triplet extraction method,this thesis designs a triplet joint extraction model based on the machine reading comprehension framework,and proposes multi-round question answering and forward and reverse questioning strategies,entity relationship prediction module and question weights smooth change strategy are proposed to solve the problems of semantic learning solidification,complex prediction work of all types and data error transmission in the machine reading comprehension framework model.Based on the analysis of ablation comparative experimental results of two granularity evaluation indexes at character level and entity level,the F-1 value of the integrated model EMRC in entity recognition and relationship classification tasks reached 85.19%,68.5% and 87.58%,71.07% respectively.(2)In this thesis,in order to verify that the triple extraction model based on the machine reading comprehension framework can effectively extract triples of "relationship overlap" type,construct an "entity-relationship" label system for the historical knowledge field of Inner Mongolia,and construct a domain triple extraction task dataset based on the label system,and change the dataset prediction labels according to different task requirements.In the pre-training stage,based on the Chinese BERT released by Google,the pre-trained language model "BERT+" that integrates domain knowledge features is trained.Design and improve the multi-task learning framework SMo E,realize the multi-task learning model of triple extraction task combined with relation overlap type prediction task,and the F-1 value of entity recognition,relationship classification and relationship overlap prediction tasks respectively reach 87.7 %,72.7% and 66.4%. |