Font Size: a A A

Corpus-based Research On Automatic Recognition Of Hakka And Gan Dialects In Jiangxi Province

Posted on:2024-04-05Degree:DoctorType:Dissertation
Country:ChinaCandidate:W Z YanFull Text:PDF
GTID:1525307112474484Subject:Modern language theory and language application
Abstract/Summary:PDF Full Text Request
With the vigorous development of computer and internet technology in recent years,an increasing number of social disciplines have begun to study social laws and solve social problems with the help of the tools of natural science and information science.As an essential tool for language intelligence,computer has proposed new research directions and application contexts for linguistics.Linguistic research has gradually advanced in a more digital and intelligent manner accordingly.Existing Hakka and Gan dialect corpora in Jiangxi Province face numerous challenges,including resource scarcity,limited scale,systematic insufficiency,and deficiencies in corresponding dialect speech recognition,dialect embranchment recognition and voice feature extraction in dialect automatic partition computing model,etc.Guided by linguistic theories and methods,exploiting field investigation of domestic dialects,with speech recognition as its goal,this thesis is an attempt to solve these problems.With corpus authenticity,balance and representation taken in to consideration,three granular phonetic corpora of word,sentence and text levels are collected in Gan dialect areas and Hakka dialect areas in Jiangxi Province through field investigation.Hence,comparatively systematic Gan and Hakka dialect speech corpora are established,and the model experiment of end-to-end dialect speech recognition based on neural network,dialect embranchment recognition experiment based on transfer learning and dialect automatic partition experiment based on feature extraction are conducted respectively.The research work in this thesis mainly focuses on the following four aspects.(1)Collection and annotation of Hakka and Gan dialectsAs a collection of authentic language materials,corpus serves as an important basic resource in linguistics research and natural language processing.First,on the basis of the previous design of the dialect corpus,the author visited most areas of Jiangxi despite the COVID-19 pandemic,and collected 320.69 hours’ dialect phonetic corpus from 2279 villagers in 62 collection spots of Gan dialect areas and 103.48 hours’ phonetic corpus from 666 villagers in 18 collection spots of Hakka dialect areas through field work,interviews,recording and on-the-spot report in nearly 2 years.Thus Jiangxi Hakka and Gan phonetic corpora are established.Second,the quality of the collected corpus is tested,the parallel corpus is marked,and the corpus annotation and translation rules are set up.Finally,part of the annotated corpus of the same speaker is randomly selected and manually cross-verified to ensure the consistency of the corpus annotation.(2)Research on the end-to-end dialect recognition model based on the neural networkThis thesis firstly reproduces the end-to-end dialect speech recognition experiment based on self-attention.On the basis of giving full play to the advantages of residual CNN and Bi-LSTM in in-frame and between-frame feature extraction,the multi-head self-attention mechanism is adopted to extract the speech features of different dialects,and the features are used for dialect speech recognition.Secondly,the cross-linguistic pre-training model Wav2 vec 2.0 is used to learn the corpora of Hakka and Gan dialects.The original audio coding is input as potential pronunciation characteristics so that the language representation ability is trained and the recognition ability on different dialects is obtained.The experimental results show: 1)good results are obtained when the Wav2 vec 2.0 model is used to recognize the phoneme error rate.The error rates of the word,sentence and text levels reach 0.602,0.613,and 0.618 respectively;2)in regard to dialect pronunciation recognition performance based on Wav2 vec 2.0 model,the lowest phoneme error rate in Gan dialect area happens in Nanchang,and Hakka dialect area performs best in the whole corpus.(3)Research on dialect embranchment recognition based on transfer learningThis thesis adopts the transfer learning method to realize the dialect recognition.First of all,the transfer learning method is leveraged to train the model on i Flytek corpora.For the selection of models,the CNN convolutional neural network model,VGG 16 model,Res Net50 model and Xception model are adopted respectively.Secondly,three granular corpora of word,sentence and text levels are selected from the corpora of Hakka and Gan dialects,and secondary training is conducted on different pre-training models.Finally,the dialect pronunciation features are extracted from the model and input into the classifier to predict the probability distribution of dialect pronunciation.The experimental results show that: 1)dialect embranchment recognition based on the Xception model performs best in terms of the model recognition effect;2)in the experiment of three granularity dialect embranchment recognition,the recognition effect of word level behaves better than that of sentence level and text level;3)Nanchang and Fuzhou in Gan dialect areas and Ganzhou in Hakka dialect areas have better recognition effect.(4)Research on automatic partition based on dialect speech featuresTaking Gan dialect as an example and on the strength of the traditional MFCC speech feature extraction,this thesis proposes a deep learning feature extraction model based on self-encoding dimension reduction voice spectrogram,and brings forward a dialect automatic partition model.K-means algorithm clustering,Gaussian mixture clustering,hierarchical clustering and spectral clustering are examined respectively.Besides,the clustering performance is used to measure the internal indicators and evaluate clustering effect of the automatic partition.The experimental results reveal,1)the clustering effect of self-encoding dimension reduction spectrogram outperforms that of other features,from the experimental data of evaluation indexes DBI and DI;2)K-means algorithm clustering,Gaussian mixture clustering,hierarchical clustering work well in terms of clustering methods;3)a part of clustering results of Gan dialects speaking areas(Nanchang,Jiujiang and Fuzhou)are comparatively close to that of artificial dialect partition,which has certain reference value for the artificial dialect partition in the future.Featuring interdiscipline,this research makes some innovation in research content and methods.The cross-linguistic model Wav2 vec 2.0-XLSR is applied to the dialect speech recognition system,which proves that the phoneme recognition task can effectively improve the performance of dialect speech recognition,and confirms that the model can effectively characterize dialects.The transfer learning method is applied to the dialect embranchment recognition,which verifies the regional classification effect of different deep transfer learning models on dialects,and effectively improves the performance of dialect embranchment recognition.Automatic partitioning of dialects with the numerical features of speech itself extends the extraction of personality features of dialects and provides an objective scientific reference for controversial artificial partition of dialects.From the perspective of social science,the establishment of dialect phonetic corpus can fill the gap in the study of Hakkas and Gan dialect,and provide rich corpus materials for the study of Hakkas dialect and Gan dialect recognition.From the perspective of natural science,the research of speech recognition in Hakkas and Gan dialect can upgrade speech intelligence and informatization of speech voice intelligence in Jiangxi so as to provide intelligent voice services for different dialect users.
Keywords/Search Tags:corpus, Hakka dialect, Gan dialect, dialect recognition, automatic partition
PDF Full Text Request
Related items