Research And Implementation Of Automatic Labeling System For Quasi Writtern Language Korean Speech Corpus

Posted on:2020-05-16

Degree:Master

Type:Thesis

Country:China

Candidate:M Y Li

Full Text:PDF

GTID:2415330572989366

Subject:Computer technology

Abstract/Summary:

PDF Full Text Request

With the advent of the artificial intelligence era,the related technologies have developed rapidly.In the context of this era,speech recognition technology is increasingly integrated into people’s lives and production.Appearing products such as voice input,voice assistant,spoken translation,intelligent customer service,and intelligent hardware.In the development of speech recognition technology,every breakthrough in the identification method is always inseparable from the support of high-quality,large-scale speech corpus.Therefore,the construction of speech corpus is an important basis for speech recognition technology.In the past few decades,the construction of many Chinese minority languages such as Uyghur,Tibetan and Mongolian has begun to take shape.However,the construction of the Korean Korean phonetic corpus is seriously inadequate.In view of this situation,this dissertation proposes an automatic labeling method for quasi-written Korean language corpus in combination with the pronunciation features of Korean,and designs an automatic labeling system.Firstly,proposing the Korean speech syllable automatic segmentation method based on the Seneff auditory model.The method determines positive and negative mutation points based on parameters such as ALSD and ED output by the Seneff auditory model.After further analyzing the main reasons of error segmentation,proposing an improved Korean speech syllable segmentation algorithm.Secondly,proposing the speech-text alignment method of quasi-written Korean language corpus.To reduce the cumulative influence of speech syllable segmentation errors in the speech-text alignment phase.In the speech-text alignment preprocessing stage,the proposed speech sentence segmentation algorithm and the speech segment fine segmentation algorithm are used to divide the textual corpus into a series of smaller speech segment sequences.Then,based on the proposed Korean continuation rule and syllable authenticity discrimination algorithm,the language alignment is realized.Finally,designing and implementing an automatic labeling system for quasi-written Korean language corpus.The system regards the automatically marks file of speech corpus as the final processing result.The speech syllable automatic segmentation algorithm and speech-text alignment method are the core technologies,and implemented in Python encoding.The design and implementation process follows the software engineering approach,performing requirements analysis,overall project design,functional module partitioning,and testing.Experimental and test results indicate,in this dissertation,the accuracy of the improved automatic segmentation algorithm is 86.76%,and the accuracy of the speech-text alignment algorithm reaches 70.26%.The functional module test of the system meets the design goal and no defects are found.The method of automatic annotation of speech corpus proposed in this dissertation is different from manual annotation and annotation based on speech recognition.It mainly realizes the automatic alignment and labeling of the Korean language corpus in the quasi-written language through the automatic segmentation of speech syllables and the speech-text alignment method.The method has the advantages of simple,efficient and easy implementation,and has certain theories and applications value for promoting the research and construction of Korean phonetic corpus.

Keywords/Search Tags:

Korean phonetic corpus, automatic quotation of text corpus, automatic segmentation of syllables, speech-text alignment

PDF Full Text Request

Related items

1	Research In Automatic Contrast Technique Of Vocabulary In Mongolian Text
2	Information Processing On Mencius And Its Commentations And Annotations
3	Applying Web Data Mining To The Parallel Corpus: The Automatic Identification And Alignment Of The Corresponding Units
4	The Semantic Relation Pattern Of "V[Double Syllables]+V[Double Syllables]" & Automatic Recognition
5	The Study Of Automatic Chinese Phoneticize Label Based On Automatic Word Segmentation
6	Research On Textual Similarity Of Ancient Chinese Annotated Corpus Based On Deep Learning
7	A Study On Cantonese Word Segmentation Specification For Information Processing
8	The Study On Chinese Text Segmentation
9	Burmese Text Analysis And Implementation For Speech Synthesis
10	Research On Automatic Evaluation Of Voice Politeness In Service Industry