| Document-level relation extraction is extended from sentence-level relation extraction,which needs to identify complex interactions between multiple entities from a document.The challenge of document-level relation extraction lies in the feature fusion from multiple sentences,so filtering out high-quality,task-related sentences from documents can help improve the accuracy of relation extraction.In this paper,document-level genetic risk relation extraction for medical genetics literature is the core task,and an efficient method for filtering sentences in documents is proposed.In the medical genetics literature,there are the following characteristics: the distribution of entities in the literature is sparse,there are a large number of sentences that are not related to genetic risk relationships,and the literature has a similar structure.Therefore,it is particularly important to filter sentences before relation extraction.At the same time,there are a large number of inaccurate pseudo-labels in the relation extraction training set constructed by the distant-supervised method,which negatively affects the model training in the form of noise.Based on the existing researches,a distant-supervised sentence selection method based on reinforcement learning is proposed,which brings convenience for researchers to extract genetic risk relationships from medical genetics literature.The main content is as follows:(1)A document graph construction method based on genetic risk relationship is proposed.The method makes full use of the semantic and structural information in medical genetics literature.Sentence nodes are constructed according to genes,mutation,and disease entities,structural nodes are constructed according to the structure of the article.Co-occurrence edges,sequential edges and structural edges are also constructed according to node characteristics.The document graph solves the problem that entity information is widely distributed in documentlevel relation extraction and cannot make full use of structural information.(2)A distant-supervised sentence selection method DGRL based on reinforcement learning is proposed.Taking the document graph as the input of the sentence selector,the reinforcement learning agent walks randomly in the document graph and marks the positive and negative sentences.The sentence selector jointly trains with the relation classifier,and continuously optimizes the selection result through reward feedback.The positive sentences are screened out from the distant-supervised dataset through the above process.The selected positive sentences solve the problem of too many negative sentences in the distant-supervised dataset.It also further improves the accuracy of sentence selection and the performance of relation extraction.(3)Design and implement a visualization system of genetic risk information.The system can search the medical genetics literature and show the entity annotations according to the input PMCID.At the same time,the document graph and the positive sentences in medical genetics literature can be dynamically displayed on the screen,which allows users to obtain genetic risk information more intuitively.The system also provides download function for genetic risk sentences.In summary,a distant-supervised sentence selection method based on reinforcement learning is proposed.Firstly,a document graph based on genetic risk relationship is constructed which integrates the structure information in documents.Then,the model selects positive sentences from medical genetics literature in the way of reinforcement learning with the document graph as the input.Finally,a visualization system of genetic risk information is designed and implemented. |