Automated Proofreading Study Of Dialect Vocabulary

Posted on:2017-09-27

Degree:Master

Type:Thesis

Country:China

Candidate:W Wei

Full Text:PDF

GTID:2355330491456908

Subject:Linguistics and Applied Linguistics

Abstract/Summary:

PDF Full Text Request

The dialect sound database of Jiangsu Province has been established in September, 2013, which is a part of the project of the Sound Database of Chinese Language Resources, started in October,2008. The data recorded manually should be proofread to enhance the data quality, which would cost a lot of time and effort. So it is necessary to proofread the data automatically, while auto-proofreading is a problem that has a great foreground in the natural language processing area.We aim to proofread the missing and improper annotation of the isolated word in the database, using speech endpoint detection algorithm and automatic speech recognition technology. We reach a precision of 99.85% when we find the missing annotation using threshold zero-cross ratio method. Without the condition of establishing a suitable speech model to recognize all dialects, it is available to auto-proofread the annotation by closed test with the limited data.First, in the exploratory experiment of proofreading the tones of Nanjing dialect using fundamental frequency, we get a precision of 90.78% and 93.61% by SGM and GMM, which shows that closed test is effective. When proofreading the tones by MFCC and HMM in syllable, we reach a precision of 98.54% on Nanjing dialect, while the precision on Suzhou and Xuzhou turns out to be 95.62% and 98.86%. Second, facing several more mistakes, we should proofread the annotation repeatedly which is proved available by the false-error-callback experiment. Third, we proofread the initials and finals with the similar method which turns out to be effective, too, and can also reach a high precision. The above exploratory experiment on the three dialects shows that it is suitable for auto-proofreading the annotation by closed test with MFCC and HMM, which are the final parameter and statistical model chosen for our system. Finally, we develop a system and test the system with the data of the other 67 dialects. With an average precision of 97.79%, we proofread the annotation and it shows that our system can be applied in practice.

Keywords/Search Tags:

dialect, isolated words, annotation, automatic proofreading

PDF Full Text Request

Related items

1	Preliminary Exploration On Automatic-proofreading Of Chinese Miswriting Characters
2	Research About Wei Jin And The Northern-Southern Dynastiesâ€™s Annotation Dialect Words
3	Research On Tibetan Automatic Proofreading Technology Based On Mutual Information
4	Study Of Automatic Annotation Of Geographical Names Based On Mongolian Corpus
5	The Study Of Tibetan Ancient Literature Proofreading
6	Features and methods for automatic dialect identification
7	A Study On The Revision Caused By Duan’s Annotation The Headwords Of Shuo Wen Jie Zi
8	A Proofreading Study On The Characters And Words Of Guo Yu(《国言》) And Zuo Zhuan(《左传》)
9	A Corpus-supported Approach To Systemic Functional Grammar:Automatic Annotation And Concordance Of Ideational And Textual Metafunctions
10	Corpus-based Research On Automatic Recognition Of Hakka And Gan Dialects In Jiangxi Province