Font Size: a A A

Research On Key Technologies Of Sentence Text Matching And Recognition

Posted on:2023-05-05Degree:MasterType:Thesis
Country:ChinaCandidate:Q Q LiFull Text:PDF
GTID:2558307169481164Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the rapid development of modern information technology,information on the Internet,especially news information,has shown an explosive growth trend.When we browse online information,we can clearly feel that the Internet is flooded with a large number of identical or similar news information,which hinders the rapid acquisition of effective information.The large-scale text flow-oriented entity trend analysis system is used for data processing and analysis of large-scale texts,which can effectively overcome these problems and obtain target-related information arranged according to time.The system includes data collection,data screening,text deduplication,and information screening modules,of which text deduplication and information screening are important links in the system.Due to the difficulty of labeling large-scale data,we can only obtain a small amount of labeling data.Since it is difficult for current text deduplication methods to clearly distinguish sentences with the same meaning but different content,we choose to optimize the key technology of unsupervised text matching,and use the sentence vector representation obtained by the text matching model to deduplicate.Named entity recognition(NER)is the basic task of information extraction,and the model trained by the NER is not effective when there is very little labeled data.Therefore,we need to optimize the small sample NER technology.First,for text matching technical problems,based on the fact that text matching relies on the calculation of similarity between sentence vector representations,improving the accuracy of sentence vector representation has become the primary task.Some previous work suggests that the difference between the positive and negative samples is reinforced when contrastive learning is used.The accuracy of sentence vector representation has been improved.Thus,we propose Pro-SimCSE which combines contrastive learning with clustering algorithm,and test the performance of the model on the STS datasets.The experiment shows that spearman coefficient under unsupervised training based on BERT-base has increased by 1.22% comparing with SimCSE.Secondly,in the context of small-sample NER,we propose a small-sample NER method under the technical problem of the limited performance of the trained model due to the small amount of labeled data.One solution for small-sample NER is active learning.Multi-round training of active learning selects the data to be labeled through a query strategy,so that the effect of the NER model under limited labeled data can be improved more quickly.However,active learning relies on multi-round sampling strategy.The labeled samples involved in the early rounds is usually few,which causes the model to improve slowly.A specific query strategy is not very universal.The query strategy needs to be booked in advance and cannot be adjusted during training.Therefore,the selected query strategy may not be useful for every dataset.In order to solve these problems,we propose a framework of data augmentation while active learning.To validate our claims,we focus on Chinese NER task and carry out extensive experiments on two public datasets.Experimental results show that our framework is effective for a series of classical query strategy.We can achieve 99% of the best deep learning model trained on full data using only 22% of the data on Resume,63% labeled data is reduced as compared to pure active learning(PAL).Finally,this paper practically applies the key techniques described above for open source news in an entity trend analysis system for large-scale text streams.In this paper,the unsupervised text matching algorithm Pro-SimCSE is used for text deduplication,and the framework of data augmentation while active learning is used for information screening.Finally the trend list of key entities is obtained.The effectiveness of our proposed key technology is proved through experiments.
Keywords/Search Tags:Natural Language Processing, Text Match, NER, Active Learning, Contrastive Learning
PDF Full Text Request
Related items