M Tech Dissertations
Permanent URI for this collectionhttp://ir.daiict.ac.in/handle/123456789/3
Browse
7 results
Search Results
Item Open Access Generative Adversarial Networks for Speech Technology Applications(Dhirubhai Ambani Institute of Information and Communication Technology, 2018) Shah, Neil; Patil, Hemant A.The deep learning renaissance has enabled the machines to understand the observed data in terms of a hierarchy of representations. This allows the machines to learn complicated nonlinear relationships between the representative pairs. In context of the speech, deep learning architectures, such as Deep Neural Networks (DNNs), Convolutional Neural Networks (CNNs) are the traditional supervised learning algorithms employing Maximum Likelihood (ML)-based optimization. These techniques reduce the numerical estimates between the generated and the groundtruth. However, the performance gap between the generated representation and the groundtruth in various speech applications is due to the fact that the numerical estimation may not correlate with the human perception mechanism. On the other hand, the Generative Adversarial Networks (GANs) reduces the distributional divergence, rather than minimizing the numerical errors and hence, may synthesize the samples with improved perceptual quality. However, the vanilla GAN (v-GAN) architecture generates the spectrum that may belong to the true desired distribution but may not correspond to the given spectral frames at the input. To address this issue, the Minimum Mean Square Error (MMSE) regularized, MMSE-GAN and CNN-GAN architectures are proposed for the Speech Enhancement (SE) task. The objective evaluation shows the improvement in the speech quality and suppression of the background interferences over the state-ofthe- art techniques. The effectiveness of the proposed MMSE-GAN is explored in other speech technology applications, such as Non-Audible Murmur-to-Whisper Speech Conversion (NAM2WHSP), Query-by-Example Spoken Term Detection (QbE-STD), and Voice Conversion (VC). In QbE-STD, a DNN-based GAN with a cross-entropy regularization is proposed for extracting an unsupervised posterior feature representation (uGAN-PG), trained on labeled Gaussian Mixture Model (GMM) posteriorgram. Moreover, the ability of Wasserstein GAN (WGAN) in improving the optimization stability and providing a meaningful loss metric that correlates to the generated sample quality and the generator's convergence is also exploited. To that effect, MMSE-WGAN is proposed for the VC task and its performance is compared with the MMSE-GAN and DNN-based approaches.Item Open Access Unsupervised speaker-invariant feature representations for QbE-STD(Dhirubhai Ambani Institute of Information and Communication Technology, 2018) R., Sreeraj; Patil, Hemant A.Query-by-Example Spoken Term Detection (QbE-STD) is the task of retrieving audio documents relevant to the user query in spoken form, from a huge collection of audio data. The idea in QbE-STD is to match the audio documents with the user query, directly at acoustic-level. Hence, the macro-level speech information, such as language, context, vocabulary, etc., cannot create much impact. This gives QbE-STD advantage over Automatic Speech Recognition (ASR) systems. ASR system faces major challenges in audio databases that contain multilingual audio documents, Out-of-Vocabulary words, less transcribed or labeled audio data, etc. QbE-STD systems have three main subsystems. They are feature extraction, feature representation, and matching subsystems. As a part of this thesis work, we are focused on improving the feature representation subsystems of QbE-STD. Speech signal needs to be reformed to a speaker-invariant representation, in order to be used in speech recognition tasks, such as QbE-STD. Speech-related information in an audio signal is primarily hidden in the sequence of phones that are present in the audio. Hence, to make the features more related to speech, we have to analyze the phonetic information in the speech. In this context, we propose two representations in this thesis, namely, Sorted Gaussian Mixture Model (SGMM) posteriorgrams and Synthetically Minority Oversampling TEchniquebased (SMOTEd) GMM posteriorgrams. Sorted GMM tries to represent phonetic information using a set of components in GMM, while SMOTEd GMM tries to improve the balance of various phone classes by providing the uniform number of features for all the phones. Another approach to improve speaker-invariability of audio signal is to reduce the variations caused by speaker-related factors in speech. We have focused on the spectral variations that exist between the speakers due to the difference in the length of the vocal tract, as one such factor. To reduce the impact of this variation in feature representation, we propose to use two models, that represent each gender, characterized by different spectral scaling, based on Vocal Tract Length Normalization (VTLN) approach. Recent technologies in QbE-STD use neural networks and faster computavii tional algorithms. Neural networks are majorly used in the feature representation subsystems of QbE-STD. Hence, we also tried to build a simple Deep Neural Network (DNN) framework for the task of QbE-STD. DNN, thus designed is referred to unsupervised DNN (uDNN). This thesis is a study of different approaches that could improve the performance of QbE-STD. We have built the state-of-the-art model and analyzed the performance of the QbE-STD system. Based on the analysis, we proposed algorithms that can impact on the performance of the system. We also studied further the limitations and drawbacks of the proposed algorithms. Finally, this thesis concludes by presenting some potential research directions.Item Open Access Vowel landmark detection for speech recognition(Dhirubhai Ambani Institute of Information and Communication Technology, 2014) Undhad, Ankur G.; Patil, Hemant A.Landmarks are the time instants in a speech utterance which marks the important events such as vowels, glides and consonants. This thesis proposes a novel Vowel Landmark Detection (VLD) algorithm to locate vowel landmarks and hence the nucleus of a vowel. VLD can find its potential application for Automatic Speech Recognition (ASR) and Automatic Phonetic Segmentation (APS). The proposed VLD method uses speech source information to detect the vowel landmarks which are points of high sonority. The excitation peaks in Hilbert envelope (HE) of Teager energy profile of zero frequency filtered (ZFF) speech signal can be interpreted as perceptually significant feature which contribute to the loudness. The performance of proposed VLD method is compared with existing loudness-based method. The results are reported on TIMIT and NTIMIT corpora. The proposed VLD algorithm has detection rate of 85.48 % (83.97 %) which is 5.06 % (7.51 %) more as compared to existing loudness-based method for TIMIT (NTIMIT) corpus, respectively. In addition, this thesis proposes use of VLD algorithm for low resource languages, viz., Gujarati and Marathi, Indian languages. The results are reported on speech recorded in three different modes, viz., read, spontaneous and lecture followed by manual phonetic transcription by the transcribers (to be used as ground truth) for Gujarati as well as Marathi. The proposed VLD algorithm has detection rate of 78.92 %, 76.40 % and 73.89 %, which has jump of 8.79 %, 7.23 % and 7.17 % more as compared to loudness-based method in lecture, spontaneous and read mode, respectively for Gujarati. Similarly, the proposed VLD algorithm has detection rate of 76.93 %, 75.16 % and 73.93 %, which has jump of 7.52 %, 7.43 % and 7.82 % more as compared to loudness-based method in lecture, spontaneous and read mode, respectively (for Marathi). The proposed algorithm is also shown to be robust against signal degradation such as white noise. The second part of the thesis is to recognize the detected vowel landmarks.Formant-based technique is used to recognize the detected vowels. The results are reported on phonetically transcribed TIMIT corpus. The recognition rate is 32.16 % on the correctly detected vowels (i.e., out of 78374 vowels, 66994 number of vowels are detected correctly and out of that 21545 vowels are recognized). Proposed method is very fast and requires no training.Item Open Access Person recognition from their hum(Dhirubhai Ambani Institute of Information and Communication Technology, 2011) Madhavi, Maulik C.; Patil, Hemant A.In this thesis, design of person recognition system based on person's hum is presented. As hum is nasalized sound and LP (Linear Predication) model does not characterize nasal sounds sufficiently, our approach in this work is based on using Mel filterbank-based cepstral features for person recognition task. The first task was consisted of data collection and corpus design procedure for humming. For this purpose, humming for old Hindi songs from around 170 subjects are used. Then feature extraction schemes were developed. Mel filterbank follows the human perception for hearing, so MFCC was used as state-of- the-art feature set. Then some modifications in filterbank structure were done in order to compute Gaussian Mel scalebased MFCC (GMFCC) and Inverse Mel scale-based MFCC (IMFCC) feature sets. In this thesis mainly two features are explored. First feature set captures the phase information via MFCC utilizing VTEO (Variable length Teager Energy Operator) in time-domain, i.e., MFCC-VTMP and second captures the vocal-source information called as Variable length Teager Energy Operator based MFCC, i.e., VTMFCC. The proposed feature set MFCCVTMP has two characteristics, viz., it captures phase information and other it uses the property of VTEO. VTEO is extension of TEO and it is a nonlinear energy tracking operator. Feature sets like VTMFCC captures the vocal-source information. This information exhibits the excitation mechanism in the speech (hum) production process. It is found to be having complementary nature of information than the vocal tract information. So the score-level fusion based approach of different source and system features improves the person recognition performance.Item Open Access Speech driven facial animation system(Dhirubhai Ambani Institute of Information and Communication Technology, 2006) Singh, Archana; Jotwani, Naresh D.This thesis is concerned with the problem of synthesizing animating face driven by new audio sequence, which is not present in the previously recorded database. The main focus of the thesis is on exploring the efficient mapping of the features of speech domain to video domain. The mapping algorithms consist of two parts: building a model to fit the training data set and predicting the visual motion with the novel audio stimuli. The motivation was to construct the direct mapping mechanism from acoustic signals at low levels to visual frames. Unlike the previous efforts at higher acoustic levels (phonemes or words), the current approach skips the audio recognition phase, in which it is difficult to obtain high recognition accuracy due to speaker and language variability.Item Open Access Gaussian mixture models for spoken language identification(Dhirubhai Ambani Institute of Information and Communication Technology, 2006) Manwani, Naresh; Mitra, Suman K.; Joshi, ManjunathLanguage Identification (LID) is the problem of identifying the language of any spoken utterance irrespective of the topic, speaker or the duration of the speech. Although A very huge amount of work has been done for automatic Language Identification, accuracy and complexity of LID systems remains major challenges. People have used different methods of feature extraction of speech and have used different baseline systems for learning purpose. To understand the role of these issues a comparative study was conducted over few algorithms. The results of this study were used to select appropriate feature extraction method and the baseline system for LID. Based on the results of the study mentioned above we have used Gaussian Mixture Models (GMM) as our baseline system which are trained using Expectation Maximization (EM) algorithm. Mel Frequency Cepstral Coefficients (MFCC), its delta and delta-delta cepstral coefficients are used as features of speech applied to the system. English and three Indian languages (Hindi, Gujarati and Telugu) are used to test the performances. In this dissertation we have tried to improve the performance of GMM for LID. Two modified EM algorithms are used to overcome the limitations of EM algorithm. The first approach is Split and Merge EM algorithm The second variation is Model Selection Based Self-Splitting Gaussian Mixture Leaning We have also prepared the speech database for three Indian languages namely Hindi, Gujarati and Telugu and that we have used in our experiments.Item Open Access Hybrid approach to speech recognition in multi-speaker environment(Dhirubhai Ambani Institute of Information and Communication Technology, 2004) Trivedi, Jigish S.; Maitra, AnutoshRecognition of voice, in a multi-speaker environment involves speech separation, speech feature extraction and speech feature matching. Traditionally, Vector Quantization is one of the algorithms used for speaker recognition. However, the effectiveness of this approach is not well appreciated in case of noisy or multi-speaker environment. This thesis describes a thorough study of the speech separation and speaker recognition process and a couple of benchmark algorithms have been analysed. Usage of Independent Component Analysis (ICA) in speech separation process has been studied in minute details. The accuracy of the traditional techniques was tested by simulation in MATLAB. Later, a hybrid approach for speech separation and speaker recognition in a multi-speaker environment has been proposed. Test results of a series of experiments that attempt to improve speaker recognition accuracy for multi-speaker environment by using the proposed hybrid approach are presented. Speaker recognition results obtained by this approach are also compared with the results obtained using a more conventional direct approach and the usefulness of the hybrid approach is established.