Font Size: a A A

Research On Core Entity Extraction For News Based On Document Structure Analysis

Posted on:2024-05-08Degree:MasterType:Thesis
Country:ChinaCandidate:H L GuoFull Text:PDF
GTID:2568307157977659Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the rapid development of the Internet,the amount of text information on the Web is growing exponentially and leads to an increasingly serious phenomenon of "information overload".A large amount of miscellaneous information increases the cost for people to obtain effective information.The core entity of Web news refers to the main description object or the entity words that serve as the main role of a news article,which can help people quickly grasp the main thrust of the news,filter and obtain high-quality information.However,the wide range of news fields,wide variety and varying length,no obvious characteristic features,as well as the fuzzy boundary of core entity words,all these have made it more difficult to extract core entities from Web news.To improve the performance of core entity extraction,it is necessary to combine the semantic information and characteristics of the context to analyze the dependency relationship between the document structure and semantics blocks of an article,thereby capture the long-distance dependence and grasp the central thrust of Web news.This thesis focuses on the analysis and construction of the document structure of news text and its impact on core entity extraction tasks.The main work of this thesis is as follows:(1)The Tree-LSTM model is widely used in natural language processing tasks due to its ability to naturally capture long-distance semantic relationships.Therefore,the thesis proposes a BERT-Tree-LSTM-CRF architecture based on Tree-LSTM method to extract core entities.Since the BERT sentence vectors exhibit high anisotropy and a concentrated distribution of high-frequency words,which can lead to a deviation between the obtained sentence vectors and the actual semantics.To address this issue,the architecture construct a semantic parse tree based on the syntactic dependency structure and textual hierarchy to capture long-distance semantic and structural information within news articles,and finally derives sentence and article vectors with rich semantic information.(2)Aiming at the problem that the traditional Tree-LSTM model needs an external parser for structural analysis of a sentence,and it is difficult to construct dependencies between sentences or paragraphs in the article,the thesis proposes a task-oriented document analysis model and core entity extraction method.The method uses a cascade Gumbel-Tree-LSTM model to analyze the document structure of news in an end-to-end manner,understand the central content of news articles and extract core entities.The experimental results show that this cascaded Gumbel-Tree-LSTM model can effectively improve the efficiency of core entity extraction.(3)To tackle the problem of wrong boundary between paragraphs in parsing the structure of a text,this thesis adopts a multi-task learning method to constrain the paragraph boundaries.This method learns the semantic boundaries of paragraphs by introducing a text segmentation auxiliary task,thereby improves the correctness of some paragraph structures in the text parse tree,thus improved the ability of the model to understand the central content and the efficiency of core entity extraction.The experimental results have proven the effectiveness of the core entity extraction method based on this multi task learning strategy.
Keywords/Search Tags:Core entity recognition, Document structure, Text segmentation, Multi-task learning
PDF Full Text Request
Related items