Font Size: a A A

Research And Implementation Of Plant Relationship Extraction In Tibetan Plateau Based On Distant Supervision

Posted on:2022-11-06Degree:MasterType:Thesis
Country:ChinaCandidate:Z Y CuiFull Text:PDF
GTID:2480306764456874Subject:Computer Software and Application of Computer
Abstract/Summary:PDF Full Text Request
The boundless alpine meadows,alpine grasslands and alpine desertification grasslands in Tibet are rich in grasses and sedges.The complex ecological environment provides an environment for the diversity of plant species,and also provides an environment for the development of forestry,animal husbandry and other industries in Tibet.Agriculture laid the foundation,which occupied a prominent position in Tibet's economic construction.With the development of deep learning,the powerful data analysis and learning capabilities of computers have been able to provide services in different fields.Aiming at the rich plant resources in Tibet,the research uses the relation extraction technology based on remote supervision to clean,transform and process the text data related to the Tibetan plateau plants on the Internet,and extract the knowledge triples of Tibetan plateau plants that can be directly used.However,due to the special annotation method of remote supervision,it also brings some problems that need to be solved urgently.First of all,the automatic annotated datasets of remote supervised learning will inevitably lead to the occurrence of mislabeling,because we cannot guarantee that every sentence can correctly represent the relationship between two entities,resulting in a lot of noise.Second,the data obtained by remote supervision relying on external knowledge bases or knowledge graphs often show a power-law distribution,which makes it difficult for some medium and long-tail entity pairs to obtain.Whether it is noise problem or long-tailed data,the performance of relation extraction based on remote supervision is greatly limited.Therefore,the research will improve remote supervision relation extraction through the following two aspects:(1)Due to the lack of annotated corpus in the field of plants on the Tibetan Plateau,a method for constructing a Distant-supervised dataset was proposed to alleviate the long-tailed data and reduce the noise impact caused by incorrect labeling from the training data.The innovation of the method lies in: using the similarity of sentence features and their keywords to determine the relationship between entities to the greatest extent.First,align the open domain knowledge base with the Tibetan Plateau Species Index,extract triples,and build the Tibetan Plateau Plant Domain Knowledge Base;secondly,crawl relevant sentence sets containing corresponding entities from websites such as Baidu Encyclopedia through the knowledge base to form Unlabeled corpus;then,by establishing relational feature words,aiming at the noise problem of remote supervision relation extraction,an automatic corpus labeling algorithm based on dependency parsing and sentence similarity is designed to reduce noise data and generate labeled corpus;finally,Comparing the automatic corpus labeling method based on dependency parsing and sentence similarity with various methods,the experiment proves that the automatic labeling method in this paper has a great improvement in accuracy,and can significantly reduce the length of time brought by remote supervised learning.Tail data and noise effects.(2)A relation extraction model,MPCNN,is proposed.MPCNN divides the relationship extraction task into six modules,and uses convolution neural network,multi-head self-attention mechanism and sentence feature selection to select high-quality sample data to improve the recognition rate of medium-and long-tail entity relations and alleviate the impact of remote monitoring noise data.The innovation of MPCNN model lies in: after extracting features from the convolution module,the multi-head self-attention mechanism is used to allocate the feature weights of different words in sentences more effectively,increase the weight of correct sentences,and reduce the interference of noisy sentences.Finally,through experiments and comparison,compared with CNN+ATT and other models,the accuracy and stability of the text model trained by this framework are improved,which verifies the feasibility of the relationship extraction model proposed in this paper in the field of Tibetan plateau plants.Finally,on the basis of the above research,this paper stores the triplet data through the Neo4 j graph database,designs and implements the knowledge graph display system for the Tibetan plateau plant field based on the ASP.NET front-end web page technology,and realizes the algorithm application of distant supervision relationship extraction.
Keywords/Search Tags:Relation Extraction, Automatic Annotation, Distant Supervision, Multi-head Self-attention, Knowledge Graph
PDF Full Text Request
Related items