Font Size: a A A

The Research On Identification Of Chinese Varieties In The Greater China Region

Posted on:2021-04-05Degree:MasterType:Thesis
Country:ChinaCandidate:Y M SunFull Text:PDF
GTID:2415330620468764Subject:Intelligent information processing
Abstract/Summary:PDF Full Text Request
Automatic language recognition is the first step in language processing and language understanding.Accurately detecting the language used in a document is the key step in many natural language processing tasks,such as automatic text classification,machine translation,and multilingual data collection.In recent years,with the advance of research on automatic language recognition,different kinds of languages can be detected with high recognition rates.Since language resources are relatively lacking,and the distance between any two languages is relatively close in language variants,automatic language recognition in language variants is still a challenging task.Due to the influence of region,history,culture,social environment,etc.in the greater China region,there are differences in vocabulary,grammar,and pragmatics of Chinese used in various regions,which are variants of generalized modern Chinese.Different from traditional linguists viewpoint,this article focuses on the research of Chinese variant recognition in the greater China region from the perspective of computational linguistics and natural language processing,and analyzes the difference among these Chinese variants in the greater China region.The main research contents are two-fold as follows:(1)Construction of Chinese variation recognition model in the greater China region by integrating with the classic text classification modelsThis paper proposes to integrate the classic text classification methods,including traditional machine learning approaches and deep learning-based models.Specifically,we adopt a majority voting algorithm to build a new Chinese variant recognition model in the greater China region,and apply the model to the news article in the greater China region.Experiments were conducted on the captured categorical corpus data sets.The experimental results show that the Chinese variant recognition model constructed in the greater China region can synthesize the advantages of a single model to obtain better performance.(2)Construction of Chinese variation recognition model in the greater China region based on SENet(Squeeze-and-Excitation Networks)attention mechanismInspired by a single classic text classification model that incorporates the attention mechanism,this paper constructs a recognition model for Chinese variants in the greater China region based on the SENet attention mechanism,and uses the SENet attention mechanism to capture the differences among Chinese variants in the greater China region.It can increase the weight of important discriminative words dynamically.Meanwhile,the original word vector features are also incorporated in the training process.Compared with the classic text classification method,the recognition effect of the Chinese variant recognition model based on the SENet attention mechanism in the greater China has been significantly improved.A detailed visualization analysis of the experimental results also verifies the effectiveness of the attention model.
Keywords/Search Tags:Language Identification, The Greater China Region, Chinese Variants, Ensemble Method, SENet, Attention Mechanism
PDF Full Text Request
Related items