Generative Adversarial Networks for Speech Technology Applications

Shah, Neil

Generative Adversarial Networks for Speech Technology Applications

Files

201611055_Neil Shah.pdf (3.44 MB)

Date

2018

Authors

Shah, Neil

Publisher

Dhirubhai Ambani Institute of Information and Communication Technology

Abstract

The deep learning renaissance has enabled the machines to understand the observed data in terms of a hierarchy of representations. This allows the machines to learn complicated nonlinear relationships between the representative pairs. In context of the speech, deep learning architectures, such as Deep Neural Networks (DNNs), Convolutional Neural Networks (CNNs) are the traditional supervised learning algorithms employing Maximum Likelihood (ML)-based optimization. These techniques reduce the numerical estimates between the generated and the groundtruth. However, the performance gap between the generated representation and the groundtruth in various speech applications is due to the fact that the numerical estimation may not correlate with the human perception mechanism. On the other hand, the Generative Adversarial Networks (GANs) reduces the distributional divergence, rather than minimizing the numerical errors and hence, may synthesize the samples with improved perceptual quality. However, the vanilla GAN (v-GAN) architecture generates the spectrum that may belong to the true desired distribution but may not correspond to the given spectral frames at the input. To address this issue, the Minimum Mean Square Error (MMSE) regularized, MMSE-GAN and CNN-GAN architectures are proposed for the Speech Enhancement (SE) task. The objective evaluation shows the improvement in the speech quality and suppression of the background interferences over the state-ofthe- art techniques. The effectiveness of the proposed MMSE-GAN is explored in other speech technology applications, such as Non-Audible Murmur-to-Whisper Speech Conversion (NAM2WHSP), Query-by-Example Spoken Term Detection (QbE-STD), and Voice Conversion (VC). In QbE-STD, a DNN-based GAN with a cross-entropy regularization is proposed for extracting an unsupervised posterior feature representation (uGAN-PG), trained on labeled Gaussian Mixture Model (GMM) posteriorgram. Moreover, the ability of Wasserstein GAN (WGAN) in improving the optimization stability and providing a meaningful loss metric that correlates to the generated sample quality and the generator's convergence is also exploited. To that effect, MMSE-WGAN is proposed for the VC task and its performance is compared with the MMSE-GAN and DNN-based approaches.

Keywords

Artificial intelligence, Neural network, Speech recognition, Voice conversion

Citation

Shah, Neil (2018). Generative Adversarial Networks for Speech Technology Applications. Dhirubhai Ambani Institute of Information and Communication Technology, xiv, 86 p. (Acc. No: T00738)

URI

http://ir.daiict.ac.in/handle/123456789/772

Collections

M Tech Dissertations

Full item page

Generative Adversarial Networks for Speech Technology Applications

Files

Date

Authors

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Description

Keywords

Citation

URI

Collections

Endorsement

Review

Supplemented By

Referenced By