Theses and Dissertations

Permanent URI for this collectionhttp://ir.daiict.ac.in/handle/123456789/1

Browse

Search Results

Now showing 1 - 2 of 2
  • ItemOpen Access
    Acoustic-to-articulatory inversion: speech quality assessment and smoothness constraint
    (Dhirubhai Ambani Institute of Information and Communication Technology, 2015) Rajpal, Avni; Patil, Hemant A.
    The ability of humans to speak effortlessly, require coordinated movements of various articulators, muscles, etc. This effortless movement contributes towards naturalness, intelligibility and speaker identity in human speech, which is only partially present in speech, obtained from most of voice conversion (VC) systems. Hence, during voice conversion, the information related to speech production is lost. For quantification of the loss in information two quantities, i.e., mutual information (I) and estimation error were calculated. In this thesis, the differences in the estimated articulator trajectories are exploited to propose articulatory features based objective measure for assessing the quality of voice conversion. Moreover, a new smoothness criterion, i.e., jerk minimization is explored to deal non-uniqueness of speech inversion mapping. Speech is the result of coordinated movements of the articulators such as lips, tongue, jaw, velum, etc. Therefore, measured trajectories obtained are smooth and slowly varying. However, the trajectories estimated from acoustic-to-articulatory inversion are found to be jagged. Thus, energy minimization is used as smoothness constraint for improving performance of the acoustic-to-articulatory inversion. Moreover, jerk (i.e., rate of change of acceleration) is known for quantification of smoothness in case of human motor movements. This motivates us to propose jerk minimization as the smoothness criteria for frame-based acoustic-to-articulatory inversion.
  • ItemOpen Access
    Spectro-temporal features based automatic speech recognition
    (Dhirubhai Ambani Institute of Information and Communication Technology, 2015) Nagpal, Ankit; Patil, Hemant A.
    ASR technology has found its application in almost every field in life. Today‟s world cannot be considered as noise-free and deploying ASR technology in such environments would incorporate the challenge to deal with various kinds of noises and channel effects. Thus, robustness of ASR is becoming increasingly important. State-of-the-art Mel Frequency Cepstral Coefficients (MFCC) features capture spectral information and some temporal dynamics in the speech signal. Spectro-temporal features, on the other hand, are more physiologically motivated, as they capture more perceptual information, and are able to perform better in the presence of noise. In this thesis, cepstral analysis, theory of cepstral coefficients (MFCC and Gammatone Frequency Cepstral Coefficients, i.e., GFCC) and motivation to use spectro-temporal features, are discussed. Furthermore, the work presents the theory behind Gabor filters and motivation to incorporate them for ASR task. Algorithm for the extraction of spectro-temporal features- Spectro-Temporal Gabor filterbank features (GBFB), is also presented in detail. Experiments are carried out on TIMIT database, with various additive noises such as white, babble, volvo and high frequency (under various SNR levels) to compare spectro-temporal features, denoted by GBFBmel+MFCC and the proposed GBFBGamm+GFCC (incorporating mel and Gammatone filters, respectively) and the state-of-the-art MFCC features. Experiments are carried out with HTK as back end, taking into account the effectiveness of acoustic and language model. It is concluded that with acoustic modeling only, spectro-temporal Gabor filterbank (GBFB) features (whether incorporating Gammatone filterbank or mel filterbank) when concatenated with cepstral coefficients perform better than the state-of-the-art MFCC features in clean conditions as well as in the presence of various additive noises or signal degradation conditions. This is because GBFB features are able to capture more local joint spectro-temporal information, than the MFCC features, from the speech signal.