Font Size: a A A

Pairwise Learning Based Acoustic Feature Representation For Low-resource Scenarios

Posted on:2020-04-19Degree:DoctorType:Dissertation
Country:ChinaCandidate:Y G YuanFull Text:PDF
GTID:1488306740972929Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
As transcribed audio data is limited in low-resource scenarios,deep learning cannot obtain a robust acoustic feature representation model.In recent years,pairwise learning can use paired examples as weak supervision to learn features,which has become a hot research direction in low-resource scenarios.The techniques of pairwise learning-based acoustic feature representation in low-resource speech scenarios are studied and their effects are investigated on the applications of ABX phoneme discrimination,isolated word discrimination,query-by-example speech search and keyword spotting.The main contributions are summarized as follows:1.Pairwise learning-based acoustic feature representation using bottleneck features.As transcribed data is limited in the target language,spectrum features based acoustic representations have poor discrimination.This article borrows transcribed audio data from non-target rich-resource languages to train extractors of cross-lingual and multilingual bottleneck features and then extracts bottleneck features in the target language as input features to learn more efficient pairwise learning-based acoustic features.The experiments on both isolated word discrimiation and query-by-example speech search tasks show that bottleneck features have better phoneme discrimination than spectrum features.After pairwise learning,the resulting acoustic feature representations can further improve the ability of phoneme or word discrimination.2.Pairwise learning-based unsupervised acoustic feature representation.The pairwise learning-based acoustic feature representations rely on paired information and discriminative input features.This article firstly uses the Dirichlet process Gaussian mixture model for labeling and trains deep neural networks with the labels to obtain unsupervised multi-lingual bottleneck features,and then selects the unsupervised term discovery algorithm to find the word-like speech pairs and applies the pairwise learning method in a fully untranscribed speech scenario to obtain more efficient frame-level acoustic feature representations.Zerospeech 2017 challenge shows that the learned frame-level acoustic feature representations reduce the average error rate to 65% of the baseline system in the ABX phoneme discrimination test.3.Pairwise learning-based acoustic word embeddings with temporal context.In queryby-example speech search,since the learning of acoustic word embeddings uses the segmented isolated words,it causes a significant mismatch in search content without word boundary.This article includes the leading and trailing frames of target words as temporal context and then uses the temporal context padding method to learn convolutional neural networks and recurrent neural networks based acoustic word embeddings.Finally,an analysis window is shifted on search content to find the matching spoken query.Compared with the frame-level autoencoder features,temporal context-based recurrent neural acoustic word embeddings relatively improves the search speed by 9.35 times and the mean average precision by 16.5%.4.Learning attention-based deep binary embeddings for fast query-by-example speech search.Acoustic word embeddings usually have a high dimension and their elements are all real values,thus resulting in a large computation on the distance measurement.This article takes a deep hashing netowrk to learn deep binary embeddings and uses Hamming distances to improve search speed.Besides,it introduces an attention mechanism in the deep hashing network at the same time and guides the network training with three specifically-designed objectives,including a penalization term,a quantization loss and a triplet loss.Compared with the recurrent neural acoustic word embeddings,deep binary embeddings relatively improve the search speed by 8 times and the relative mean average precision by 18.9%.5.Verifying deep keyword spotting detection with acoustic word embeddings.Deep keyword spotting(Deep KWS)systems suffer from performance degradation in real unconstrained scenarios.This article proposes a two-stage keyword search method that uses acoustic word embeddings to perform template matching to verify the speech keyword candidates.The acoustic word embeddings are learned with three specific objective functions,including triplet loss,reversed triplet loss and hinge loss.Experiments show that the relative accuracy of pairwise learning-based acoustic word embeddings is 13.6%higher than that of the Deep KWS system.
Keywords/Search Tags:Acoustic feature representation, pairwise learning, unsupervised learning, low-resource scenarios, speech processing
PDF Full Text Request
Related items