| Speech recognition typically refers to the process of converting human voice signals into corresponding text,and is part of the perceptual intelligence in artificial intelligence.In recent years,with the rapid development of artificial intelligence,speech recognition technology has been widely used in vehicles,smart homes and other scenarios.The huge market demand has made improving the accuracy of speech recognition a research hotspot.In previous studies,Chinese speech recognition mainly uses end-to-end word modeling as the main modeling method.This thesis investigates the method of pinyin modeling,first using the Chinese syllable as the intermediate result of speech input,and then converting the Chinese syllable for the corresponding text.On the basis of syllable modeling,this thesis mainly does the following three tasks:(1)Combining the Connectionist temporal classification and Attention algorithms,the CTC-Attention model is built as the baseline model.On the basis of the baseline model,the CTC spike distribution problem and the Layernorm parameter oscillation problem are improved,and the CTC-Attention-TESB model is obtained.Compared with the baseline model,the syllable character error rate(CER)of CTC-Attention-TESB model is reduced by 1.08%.After language model decoding,the CER of the baseline model trained by word modeling has decreased by 6.04%.(2)Based on the CTC-Attention-TESB model,this thesis designs a multi-task learning algorithm with text modeling as an auxiliary task and syllable modeling as the main task.Experimental verification shows that the syllable-based multi-task model reduces the CER by 1.25% compared to the single-task model,outperforms other mainstream algorithms in low-resource scenarios..(3)Aiming at the current problem of mixed languages in Chinese speech recognition,this thesis studies the selection of monolingual data and the optimization of dictionaries,and conducts experiments to verify the effectiveness of data selection. |