| In recent years,with the in-depth development of Mongolian intelligent information processing technology,Mongolian speech recognition and MongolianChinese machine translation technology is becoming more and more mature.The traditional method to realize the Mongolian-Chinese speech translation system first uses Mongolian speech recognition to transcribe the source speech into text,and then uses Mongolian-Chinese machine translation to translate the source language text into the target language text.However,this method has some problems,such as error accumulation,time delay and parameter redundancy.End-to-end speech translation translates source language speech directly into target language text,uses a model to complete speech recognition and machine translation tasks,and all parameters are optimized according to the final goal,thus alleviating the problems existing in traditional methods.However,end-to-end speech translation technology also faces some problems,the most important of which is the scarcity of data resources,which leads to the difficulty of model training.Therefore,in the case of sparse data,it is of great research value to improve the performance of Mongolian-Chinese speech translation by exploring suitable end-to-end Mongolian-Chinese speech translation model structure and training methods.The main contents of this dissertation are as follows:1.An end-to-end Mongolian-Chinese speech translation model based on causal convolution is proposed.Based on the encoder-decoder framework of attention mechanism,this dissertation constructs an end-to-end Mongolian-Chinese speech translation baseline model,and combines the advantages of Transformer model in dealing with sequence tasks and causal convolution to obtain temporal location information,and uses causal convolution to provide location coding information for Transformer model,which further improves the performance of Mongolian-Chinese speech translation model.The experimental results show that the BLEU score of the end-to-end Mongolian-Chinese speech translation model based on causal convolution is 1.03 higher than that of the baseline model.2.An end-to-end Mongolian-Chinese speech translation model fused with RNNT is proposed.There is a multi-module network structure in this model,which mainly includes speech encoder,predictor,text encoder and text decoder,so it can carry more pre-training model structure,and it can also be trained directly using the data set of end-to-end speech translation.Its BLEU score is 0.45 higher than that of the baseline model,but the generalization ability of the model is still poor because of the sparse data.3.In order to solve the problem of data sparsity,this dissertation proposes a training method based on the combination of multi-level pre-training strategy and multi-task learning by using multi-module network structure.Compared with the original end-to-end direct training method,the multi-level pre-training strategy introduces more complexity.The original training model can only use training data such as "source language speech-target language text".In the training method proposed in this dissertation,you can use "speech-text" data in other languages(such as English),"speech-text" data in the source language,simple source language text data,and "source language text-target language text" data.The effective knowledge in these data is transferred to the model during the training process.At the same time,combined with multi-task learning,speech recognition is set as the intermediate auxiliary target,and model constraints are added.The experimental results show that the BLEU score of the end-to-end Mongolian-Chinese speech translation model fused with RNN-T is improved by 3.84 compared with the direct training method after using the training method proposed in this dissertation. |