Font Size: a A A

Research On Multi-task Text Analysis Based On BERT

Posted on:2022-03-08Degree:MasterType:Thesis
Country:ChinaCandidate:G Y SongFull Text:PDF
GTID:2517306311465474Subject:Statistics
Abstract/Summary:PDF Full Text Request
Against such a digital,networked and global environment,social media connects people and redefines the mode of text creation and distribution.As a result,both the number and complexity of text are showing an "exponential growth" trend.As such,it is particularly important to manage the massive and unstructured text data intelligently,this requires that Internet practitioners and researchers should use deep learning models to automatically identify the underlying themes of text to quickly and accurately understand information from different languages.As things stand,text classification is not only the basic function of text information mining,but also one of the core technologies for processing and modeling natural language text,which has high research value.Compared with traditional publications,on the one hand,users often use non-standard expressions such as colloquial language and slang when writing online texts,which makes it difficult to explore the characteristics of the texts.On the other hand,traditional text classification algorithms have higher requirements for training samples,while Chinese corpus construction starts late,there is a certain gap between the scale,quality,topic and development of the data from English corpus.Based on the multilingual pre-training model XLM-Roberta,this thesis designed a model suitable for Chinese sentence-level text classification,its results are as follows:1.For data lack of data,this paper uses an efficient data migration method to map the training set and text information characteristics of English in English under the supervision of Bleu to expand the number of training samples.In the depth learning model,enter the Chinese and English text,using different language distribution differences,enhance the complexity of training samples,and build a cross-language data set for text classification tasks.2.For the particularity of the data set,this paper combines random sampling and keyword extraction techniques to propose a pretreatment algorithm:shield some of the words in each group of sentences,without oversight pre-training without affecting single language performance.After in-depth pre-training,this paper makes full use of multilingual model XLM-R to extract the ability to embed vector.Subsequently,text feature vectors are entered into subsequent tasks based on the text classifier based on the map neural network.3.According to the association of the sentence classification task and other NLP tasks,this article is designed with a model optimization method based on multi-task learning:using naming entity identification,keyword extraction as a secondary task of text classification,constructing and combining multiple tasks Learner is integrated.Since Bert can handle multiple NLP tasks,this method only needs to add a shared parameter layer outside the universal Transformer framework to complete,overcome the traditional language model can only train the current task and difficult to migrate.At the same time,through comprehensive extraction and training of different semantic features of multiple task samples,the data usage rate can be effectively improved.This thesis has in-depth research model of multilingual text classification model,and doing related experiments on the problem classification data set.The results of the pre-training experiment show that the random shielding algorithm has a maximum of 98%of the F1 value on the Chinese test set,which is more than 4 percentage points higher than the Bert-Large model,and the performance of the algorithm in Chinese environment is not very different.The pre-processing algorithm retains the textual characteristics of the English language,implements cross-language data migration;the results of the comparative experiment show that the predictive accuracy of MT-XLMR is significantly better than other single task models,indicating that multi-task learning methods can effectively enhance Chinese text Classified performance.
Keywords/Search Tags:Natural Language Processing, BERT, Cross-lingual, Text Classification, Multi-task learning
PDF Full Text Request
Related items