M Tech Dissertations

Permanent URI for this collectionhttp://ir.daiict.ac.in/handle/123456789/3

Browse

Search Results

Now showing 1 - 3 of 3
  • ItemOpen Access
    Deep learning techniques for speech pathology applications
    (2020) Purohit, Mirali Virendrabhai; Patil, Hemant A.
    Human-machine interaction has gained more attention due to its interesting applications in industries and day-to-day life. In recent years, speech technologies have grown rapidly because of the advancement in fields of machine learning and deep learning. Various deep learning architectures have shown state-of-theart results in different areas, such as computer vision, medical domain, etc. We achieved massive success in developing speech-based systems, i.e., Intelligent Personal Assistants (IPAs), chatbots, Text-To-Speech (TTS), etc. However, there are certain limitations to these systems. Speech processing systems efficiently work only on normal-mode speech and hence, show poor performance on the other kinds of speech such as impaired speech, far-field speech, shouted speech, etc. This thesis work is contributed to the improvement of impaired speech. To address this problem, this work has two major approaches: 1) classification, and 2) conversion technique. The new paradigm, namely, weak speech supervision is explored to overcome the data scarcity problem and proposed for the classification task. In addition, the effectiveness of the residual network-based classifier is shown over the traditional convolutional neural network-based model for the multi-class classification of pathological speech. With this, using Voice Conversion (VC)-based techniques, variants of generative adversarial networks are proposed to repair the impaired speech to improve the performance of Voice Assistant (VAs). Performance of these various architectures is shown via objective and subjective evaluations. Inspired by the work done using the VC-based technique, this thesis is also contributed in the voice conversion field. To that effect, a state-of-the-art system, namely, adaptive generative adversarial network is proposed and analyzed via comparing it with the recent state-of-the-art method for voice conversion.
  • ItemOpen Access
    Generative Adversarial Networks for Speech Technology Applications
    (Dhirubhai Ambani Institute of Information and Communication Technology, 2018) Shah, Neil; Patil, Hemant A.
    The deep learning renaissance has enabled the machines to understand the observed data in terms of a hierarchy of representations. This allows the machines to learn complicated nonlinear relationships between the representative pairs. In context of the speech, deep learning architectures, such as Deep Neural Networks (DNNs), Convolutional Neural Networks (CNNs) are the traditional supervised learning algorithms employing Maximum Likelihood (ML)-based optimization. These techniques reduce the numerical estimates between the generated and the groundtruth. However, the performance gap between the generated representation and the groundtruth in various speech applications is due to the fact that the numerical estimation may not correlate with the human perception mechanism. On the other hand, the Generative Adversarial Networks (GANs) reduces the distributional divergence, rather than minimizing the numerical errors and hence, may synthesize the samples with improved perceptual quality. However, the vanilla GAN (v-GAN) architecture generates the spectrum that may belong to the true desired distribution but may not correspond to the given spectral frames at the input. To address this issue, the Minimum Mean Square Error (MMSE) regularized, MMSE-GAN and CNN-GAN architectures are proposed for the Speech Enhancement (SE) task. The objective evaluation shows the improvement in the speech quality and suppression of the background interferences over the state-ofthe- art techniques. The effectiveness of the proposed MMSE-GAN is explored in other speech technology applications, such as Non-Audible Murmur-to-Whisper Speech Conversion (NAM2WHSP), Query-by-Example Spoken Term Detection (QbE-STD), and Voice Conversion (VC). In QbE-STD, a DNN-based GAN with a cross-entropy regularization is proposed for extracting an unsupervised posterior feature representation (uGAN-PG), trained on labeled Gaussian Mixture Model (GMM) posteriorgram. Moreover, the ability of Wasserstein GAN (WGAN) in improving the optimization stability and providing a meaningful loss metric that correlates to the generated sample quality and the generator's convergence is also exploited. To that effect, MMSE-WGAN is proposed for the VC task and its performance is compared with the MMSE-GAN and DNN-based approaches.
  • ItemOpen Access
    Replay spoof detection using handcrafted features
    (Dhirubhai Ambani Institute of Information and Communication Technology, 2018) Tapkir, Prasad Anil; Patil, Hemant A.
    In the past few years, there has been noteworthy demand in the use of Automatic Speaker Verification (ASV) system for numerous applications. The increased use of the ASV systems for voice biometrics purpose comes with the threat of spoofing attacks. The ASV systems are vulnerable to five types of spoofing attacks, namely, impersonation, Voice Conversion (VC), Speech Synthesis (SS), twins and replay. Among which, replay possess a greater threat to the ASV system than any other spoofing attacks, as it neither require any specific expertise nor a sophisticated equipment. Replay attacks require low efforts and most accessible attacks. The replay speech can be modeled as a convolution of the genuine speech with the impulse response of microphone, multimedia speaker, recording environment and playback environment. The detection difficulty of replay attacks increases with a high quality intermediate devices, clean recording and playback environment. In this thesis, we have propose three novel handcrafted cepstral feature sets for replay spoof detection task, namely, Magnitude-based Spectral Root Cepstral Coefficients (MSRCC), Phase-based Spectral Root Cepstral Coefficients (PSRCC) and Empirical Mode Decomposition Cepstral Coefficients (EMDCC). In addition, we explored the significance of Teager Energy Operator (TEO) phase feature for replay spoof detection. The EMDCC feature set replace the filterbank structure with Empirical Mode Decomposition (EMD) technique to obtain the subband signals. The number of subbands obtained for the replay speech signal using EMD is more as compared to the genuine speech signal. The MSRCC and PSRCC feature sets are extracted using spectral root cepstrum of speech signal. The spectral root cepstrum spreads the effect of additional impulse responses in replay speech over entire quefrencydomain. The TEO phase feature set provides the high security information when fused with other magnitude-based features, such as Mel Frequency Cepstral Coefficients (MFCC). The experiments are performed on ASV spoof 2017 challenge database and all the systems are implemented using Gaussian Mixture Model (GMM) as a classifier. All the feature set performs better than the ASV spoof 2017 challenge baseline Constant Q Cepstral Coefficients (CQCC) system.