Font Size: a A A

Research On Text Similarity And Text Association Analysis Based On Multi-granularity Information

Posted on:2023-11-04Degree:MasterType:Thesis
Country:ChinaCandidate:J W ShiFull Text:PDF
GTID:2558307070983319Subject:Signal and Information Processing
Abstract/Summary:PDF Full Text Request
Text similarity and text association analysis are two important branches of the natural language processing(NLP)tasks,which play core supporting roles in many practical applications.With the successful application of deep learning in the field of NLP,deep neural network models have been applied to the tasks of text similarity and text association analysis to reduce the cost of manual feature engineering.On the one hand,most models are designed for English text in the previous research on text similarity analysis.The common approach to applying these models to Chinese is taking Chinese characters directly or segmenting each sentence into words with an existing Chinese words segment(CWS)system.However,the CWS systems have the problems of word segmentation error and semantic segmentation,which result in the challenges of effective features construction and semantic understanding.On the other hand,in the research of text association analysis,the text associated clauses need to be labeled in datasets in advance,which limits the application of the model in the actual scenes.Based on the above two points,this thesis studies the two directions of designing effective Chinese text representations to solve the limitations of the CWS systems,and optimizing text association analysis algorithms to solve the dependency of dataset annotation.The innovations and contributions of this work are summarized as follows:(1)This thesis proposes a Chinese text representation method based on multi-granularity information(Hyperlexicon),which can extract the complete vocabulary information in the text.At the same time,this thesis designs three fusion methods to integrate text multi granularity information for model training.Based on the Hyper Lexicon,the character-word twostream network(CL2N)is designed.The network can extract singlesentence features and interactive features to improve the performance of text similarity analysis.(2)This thesis studies the emotional cause analysis task,which is the main task of text association analysis.In this study,the text association analysis task is divided into two sub-tasks: 1)extraction of associated elements(extraction of emotion clauses and reason clauses);2)combination and filtering of associated elements.In the first sub-task,a mutual assistance single-task model(MASTM)based on multi-granularity information is proposed to extract the associated elements.In the second sub-task,the Cartesian product is used to combine the related elements,and the relative position information of the associated clauses is added to assist the filtering.Then,three filters with different granularity are designed to calculate the correct groups of associated clauses.Furthermore,this study uses several public datasets and frontier models to conduct comparative experiments on the tasks of text similarity analysis and text association analysis,and analyzes the experimental results from multiple dimensions to verify the effectiveness of the method proposed in this thesis.The experimental results show that,CL2 N has better performance than the existing short text matching models in text similarity analysis task.This method can solve the problem of error propagation of Chinese word segmentation.The combination of single-sentence features and interactive features allows the network to capture contextual semantic information and vocabulary information of common concern between sentences,which is helpful to obtain the best results of the model.The results also show that,in the text association analysis task,the method in this thesis can extract the emotion clause and emotional cause clause with relevance at the same time,solve the problem of dataset pre annotation,and achieve better accuracy in recognition and extraction.Compared with the multi-task learning model,the F1 score of the MASTM is increased by 5.3%.
Keywords/Search Tags:Text Similarity analysis, Text Association Analysis, Multi-granularity information, Word segmentation, Dependency of dataset annotation
PDF Full Text Request
Related items