Font Size: a A A

Research On Automatic Speech Recognition Of Sensitive Words Based On Few-shot

Posted on:2023-02-21Degree:MasterType:Thesis
Country:ChinaCandidate:J W ZhaoFull Text:PDF
GTID:2568307061450614Subject:Cyberspace security
Abstract/Summary:PDF Full Text Request
With the popularity of voice communication through the network,speech recognition technology has brought great convenience to more and more fields.But correspondingly,with the rapid development of the Internet and mobile terminals,many illegal activities are also carried out through the network.Although the speech recognition technology for Mandarin has been relatively mature,it is difficult to obtain many speech datasets due to the limited speakers of minority languages.The insufficient sample leads to the recognition rate of minority languages remaining low and easy to become a hotbed of criminal activities.Meanwhile,criminals are more inclined to use the sensitive words containing undesirable information in their communication,so timely detection of sensitive words in the speech disseminated in the network is also an important part of combating illegal and criminal activities.Traditional feature extraction schemes cannot fully obtain the key information in speech,making the few speech data not fully utilized in small samples scenes.On the other hand,as a one-dimensional time-domain signal,speech is different from images.The speech needs to use the extracted features for training classifiers and dealing with information changes in the time dimension.Therefore,few-shot are not as widely studied in the speech domain as in the vision domain.Few-shot learning often faces a scarcity of labeled datasets,overfitting network models during training,and low recognition accuracy.Given the above problems,this thesis researches feature extraction methods,acoustic models,sensitive words recognition and the few-shot learning in small sample scenes.The main work is described as follows:(1)A high-resolution speech feature extraction scheme,u MFCC,is proposed to use speech content fully in small samples scenes.Firstly,single-frequency filtering is extracted from the pre-processed speech.Then the single-frequency filtered information is weighted and combined according to the importance of the language content to extract a higher resolution spectrogram.The extracted acoustic features contain more important information,showing whether a speech contains sensitive words.This scheme reduces the word error rate(WER)by 3.29% and has robustness.(2)An data augmentation is proposed to expand the number of samples.An acoustic model,Uygformer,is proposed to extract the dependence between time and frequency.And Uygformer joint connectionist temporal classification predicts the output content based on the contextual pronunciation even on few-shot learning.Firstly,to compensate for the shortcomings of small sample data,three novel data augmentation methods are investigated to increase the number of training samples.Then,an end-to-end model architecture based on encoder-decoder is designed,adding two attention mechanisms in the feature input part to extract the dependencies between time and frequency.Finally,the connectionist temporal classification in the encoder output part is proposed to ensure the sequence alignment during training and to use all the information in the speech fully.The experimental results demonstrate that the proposed scheme in this thesis shows a better advantage on few-shot speech recognition represented by Uyghur.(3)Transfer learning is approached to extend the speech recognition research field to all minority languages.Uyghur and Spanish are used as examples and further improve the correct rate of speech recognition under small samples scenes.Firstly,to ensure the effect of transfer learning,a cross-language similarity detection method based on machine translation is proposed to ensure the similarity between the source and target domain languages and avoid the negative transfer.Subsequently,this thesis chooses to use Pre-Training,Fine-tuning transfer methods for English to Uyghur,Chinese to Uyghur,and English to Spanish.And does comparison experiments with fine-tuning only the last layer and fine-tuning the whole decoder.Verifying that transfer learning in small sample scenes can further reduce the WER of Uyghur by 8.66% compared to data augmentation and can be extended to other languages such as Spanish.
Keywords/Search Tags:Automatic Speech Recognition, Deep Leaning, Few-shot Learning, Transfer Learing, Internet Security
PDF Full Text Request
Related items