| With the development of artificial intelligence,speech recognition and prediction have become one of the important research domain with various applications,such as intelligent control,education,individual identification,and emotion analysis.Chinese poetry reading contains rich features of continuous pronunciations such as mood,emotion,rhythm schemes,lyric reading and artistic expression,etc.Therefore,the prediction on the pronunciation characteristics of Chinese poetry reading is significance for presentation of high level machine intelligence,and has potential to create a high level intelligent system for teaching children to read Tang poetry.Mel Frequency Cepstral Coefficient(MFCC)is a currently used to present important speech features.Due to the complexity and high degree of non-linearity in poetry reading,however,there is a tough challenge facing accurate pronunciation feature prediction,that is,how to model complex spatial correlations and time dynamics such as rhyme schemes.As for many current methods,they ignore the spatial and temporal characteristics in MFCC presentation.In addition,these methods are subjected to certain limitations on prediction for the long-term performance.In order to solve these problems,we propose a novel spatial-temporal graph model(STGM-MHA)based on multi-head attention for the purpose of pronunciation feature prediction of Chinese poetry.The STGM-MHA is designed using an encoder-decoder structure.The encoder compresses the data into a hidden space representation,while the decoder reconstructs the hidden space representation as output.In the model,a novel Gated Recurrent Unit(GRU)module(AGRU)based on multi-head attention is proposed to extract the spatial and temporal features of MFCC data effectively.The evaluation comparison of our proposed model vs.state-of-the-art methods in six datasets reveals the clear advantage of the proposed model.The main contributions of this article are summarized as follows.(1)We have proposed a new model STGM-MHA which can effectively extract features of Chinese poetry by means of the graph modeling and analyze the speech in spatial-temporal graph for the first time;(2)Based on the multi-head attention mechanism and GNN,a novel neural network model is proposed to capture the spatial and temporal dependence of MFCC effectively for the first time;(3)A novel module named AGRU is proposed.In order to reduce the complexity of the model,an autoencoder is introduced to significantly improve the training efficiency,and a scheduled sampling mechanism is applied to improve the accuracy of prediction.Moreover,various experiments are conducted on six speech datasets from six famous poets in Tang Dynasty,including Du Fu,Du Mu,Li Bai,Meng Haoran,Li Shangyin and Wang Wei.At the same time,four common evaluation metrics(MAE,MSE,RMSE and p-value)are used to verify the performance of the model.According to the experimental results,our proposed model performs better in making prediction on both public datasets compared to other state-of-the-art methods;(4)This model can be used not only for MFCC prediction,but also for other spatial-temporal prediction tasks.In the process of reading Tang poetry,the rhythm,style changes and thoughts and emotions of poetry virtually affect students,which is conducive to the accumulation of children knowledge and the broadening of their horizons,and the proposed model can be used to help children read Chinese poetry or speech synthesis. |