Unsupervised speaker-invariant feature representations for QbE-STD
Date
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
Query-by-Example Spoken Term Detection (QbE-STD) is the task of retrieving audio documents relevant to the user query in spoken form, from a huge collection of audio data. The idea in QbE-STD is to match the audio documents with the user query, directly at acoustic-level. Hence, the macro-level speech information, such as language, context, vocabulary, etc., cannot create much impact. This gives QbE-STD advantage over Automatic Speech Recognition (ASR) systems. ASR system faces major challenges in audio databases that contain multilingual audio documents, Out-of-Vocabulary words, less transcribed or labeled audio data, etc. QbE-STD systems have three main subsystems. They are feature extraction, feature representation, and matching subsystems. As a part of this thesis work, we are focused on improving the feature representation subsystems of QbE-STD. Speech signal needs to be reformed to a speaker-invariant representation, in order to be used in speech recognition tasks, such as QbE-STD. Speech-related information in an audio signal is primarily hidden in the sequence of phones that are present in the audio. Hence, to make the features more related to speech, we have to analyze the phonetic information in the speech. In this context, we propose two representations in this thesis, namely, Sorted Gaussian Mixture Model (SGMM) posteriorgrams and Synthetically Minority Oversampling TEchniquebased (SMOTEd) GMM posteriorgrams. Sorted GMM tries to represent phonetic information using a set of components in GMM, while SMOTEd GMM tries to improve the balance of various phone classes by providing the uniform number of features for all the phones. Another approach to improve speaker-invariability of audio signal is to reduce the variations caused by speaker-related factors in speech. We have focused on the spectral variations that exist between the speakers due to the difference in the length of the vocal tract, as one such factor. To reduce the impact of this variation in feature representation, we propose to use two models, that represent each gender, characterized by different spectral scaling, based on Vocal Tract Length Normalization (VTLN) approach. Recent technologies in QbE-STD use neural networks and faster computavii tional algorithms. Neural networks are majorly used in the feature representation subsystems of QbE-STD. Hence, we also tried to build a simple Deep Neural Network (DNN) framework for the task of QbE-STD. DNN, thus designed is referred to unsupervised DNN (uDNN). This thesis is a study of different approaches that could improve the performance of QbE-STD. We have built the state-of-the-art model and analyzed the performance of the QbE-STD system. Based on the analysis, we proposed algorithms that can impact on the performance of the system. We also studied further the limitations and drawbacks of the proposed algorithms. Finally, this thesis concludes by presenting some potential research directions.
