| As history develops,cross-border ethnic groups originate from the same ethnic group living in different countries,and have basically the same traditional cultural customs such as literature and art,festivals and ceremonies.With the communication with cross-border ethnic groups,many data in the field of cross-border ethnic culture appear on the Internet.This thesis uses deep learning technology to extract domain entity information from texts and the relationship between entity pairs is of great significance for promoting research in the field of cross-border ethnic culture.There are still the following problems in cross-border ethnic culture field of information extraction by adopting the current mainstream entity recognition method and entity relation extraction method: There is a lack of entity and entity relationship extraction annotation data set in the field of cross-border ethnic culture.There are a large number of professional field words and combination words in the text of cross-border ethnic culture,which lead to fuzzy entity boundary,overlapping entity relationship and inaccurate segmentation of domain words.To solve the above problems,this thesis mainly completes the following work:(1)Cross-border ethnic cultural entity and entity relation extraction corpus construction:By analyzing the data of cross-border ethnic groups in Yunnan province,the Dai/Yi ethnic groups within China and the Tai,Shan,Lao and Lo-Lo ethnic groups overseas,which selected as the main cross-border ethnic groups.We introduce the characteristics of entities and relations in the field of cross-border ethnic culture in detail,and construct5,000 domain vocabulary,15,000 entity recognition datasets and 18,000 entity relation extraction datasets to provide domain characteristics analysis and data support for subsequent research.(2)Cross-border ethnic cultural entity recognition method with word set attention:Aiming at a large number of entities with fuzzy boundary and entities composed of multiple words in the text of cross-border ethnic culture,using the current mainstream entity recognition methods will face fuzzy domain entity boundary and cause entity recognition errors.This thesis proposes a cross-border ethnic culture entity recognition method integrating word set information,which can alleviate the fuzzy boundary problem in cross-border ethnic culture by integrating word set information.According to the cross-border national cultural domain dictionary constructed in(1),the domain word vector is trained,the word set information is obtained according to the word set matching method,and the weight distribution of the word set information is carried out by the word set attention mechanism,so as to integrate the word set information into the linguistic information representation of the text obtained by pretrained language model.Feature encoding via Bi-GRU and self-attention mechanism.Feature encoding is used to train entity recognition model.Finally,experiments show that the F1 value of the proposed method reaches 94.71%.(3)Extraction method of cross-border ethnic cultural entity relationship integrated into domain dictionary:In the absence of domain information representation,the existing entity relationship extraction model has poor ability to mark the underlying domain entities,which leads to the extraction of many wrong entity relationships.The density distribution of entity pairs in cross-border ethnic cultural texts is high,which leads to the existence of many overlapping entity relation triples.To solve above problems is proposed based on multilayer pointer annotation of cross-border ethnic cultural entity relation extraction method,through into the text field information increased domain dictionary and convolution neural network is utilized to extract the features in the input text field,takes the characteristics of the said into the character features said information context field increased.The model uses Bi-LSTM extract context semantic information,Finally,the entity-to-entity and tail entities under relational conditions are marked out by pointer network.Experimental results show that the proposed method of F1 value reaches 82.50%.(4)Cross-border ethnic cultural entity and entity relationship extraction prototype system:Based on the theoretical methods,we designed a system using Django framework,and the environment configuration,model building,model training process and system functions of the prototype system were introduced.The model was converted into an interface and integrated into the prototype system of cross-border ethnic cultural entities and entity relation extraction through Sanic framework.The modules of the system include cross-border ethnic culture entity recognition module and cross-border ethnic culture entity relation extraction module. |