Font Size: a A A

Automated Proofreading Study Of Dialect Vocabulary

Posted on:2017-09-27Degree:MasterType:Thesis
Country:ChinaCandidate:W WeiFull Text:PDF
GTID:2355330491456908Subject:Linguistics and Applied Linguistics
Abstract/Summary:PDF Full Text Request
The dialect sound database of Jiangsu Province has been established in September, 2013, which is a part of the project of the Sound Database of Chinese Language Resources, started in October,2008. The data recorded manually should be proofread to enhance the data quality, which would cost a lot of time and effort. So it is necessary to proofread the data automatically, while auto-proofreading is a problem that has a great foreground in the natural language processing area.We aim to proofread the missing and improper annotation of the isolated word in the database, using speech endpoint detection algorithm and automatic speech recognition technology. We reach a precision of 99.85% when we find the missing annotation using threshold zero-cross ratio method. Without the condition of establishing a suitable speech model to recognize all dialects, it is available to auto-proofread the annotation by closed test with the limited data.First, in the exploratory experiment of proofreading the tones of Nanjing dialect using fundamental frequency, we get a precision of 90.78% and 93.61% by SGM and GMM, which shows that closed test is effective. When proofreading the tones by MFCC and HMM in syllable, we reach a precision of 98.54% on Nanjing dialect, while the precision on Suzhou and Xuzhou turns out to be 95.62% and 98.86%. Second, facing several more mistakes, we should proofread the annotation repeatedly which is proved available by the false-error-callback experiment. Third, we proofread the initials and finals with the similar method which turns out to be effective, too, and can also reach a high precision. The above exploratory experiment on the three dialects shows that it is suitable for auto-proofreading the annotation by closed test with MFCC and HMM, which are the final parameter and statistical model chosen for our system. Finally, we develop a system and test the system with the data of the other 67 dialects. With an average precision of 97.79%, we proofread the annotation and it shows that our system can be applied in practice.
Keywords/Search Tags:dialect, isolated words, annotation, automatic proofreading
PDF Full Text Request
Related items