Theses and Dissertations
Permanent URI for this collectionhttp://ir.daiict.ac.in/handle/123456789/1
Browse
9 results
Search Results
Item Open Access Unsupervised speaker-invariant feature representations for QbE-STD(Dhirubhai Ambani Institute of Information and Communication Technology, 2018) R., Sreeraj; Patil, Hemant A.Query-by-Example Spoken Term Detection (QbE-STD) is the task of retrieving audio documents relevant to the user query in spoken form, from a huge collection of audio data. The idea in QbE-STD is to match the audio documents with the user query, directly at acoustic-level. Hence, the macro-level speech information, such as language, context, vocabulary, etc., cannot create much impact. This gives QbE-STD advantage over Automatic Speech Recognition (ASR) systems. ASR system faces major challenges in audio databases that contain multilingual audio documents, Out-of-Vocabulary words, less transcribed or labeled audio data, etc. QbE-STD systems have three main subsystems. They are feature extraction, feature representation, and matching subsystems. As a part of this thesis work, we are focused on improving the feature representation subsystems of QbE-STD. Speech signal needs to be reformed to a speaker-invariant representation, in order to be used in speech recognition tasks, such as QbE-STD. Speech-related information in an audio signal is primarily hidden in the sequence of phones that are present in the audio. Hence, to make the features more related to speech, we have to analyze the phonetic information in the speech. In this context, we propose two representations in this thesis, namely, Sorted Gaussian Mixture Model (SGMM) posteriorgrams and Synthetically Minority Oversampling TEchniquebased (SMOTEd) GMM posteriorgrams. Sorted GMM tries to represent phonetic information using a set of components in GMM, while SMOTEd GMM tries to improve the balance of various phone classes by providing the uniform number of features for all the phones. Another approach to improve speaker-invariability of audio signal is to reduce the variations caused by speaker-related factors in speech. We have focused on the spectral variations that exist between the speakers due to the difference in the length of the vocal tract, as one such factor. To reduce the impact of this variation in feature representation, we propose to use two models, that represent each gender, characterized by different spectral scaling, based on Vocal Tract Length Normalization (VTLN) approach. Recent technologies in QbE-STD use neural networks and faster computavii tional algorithms. Neural networks are majorly used in the feature representation subsystems of QbE-STD. Hence, we also tried to build a simple Deep Neural Network (DNN) framework for the task of QbE-STD. DNN, thus designed is referred to unsupervised DNN (uDNN). This thesis is a study of different approaches that could improve the performance of QbE-STD. We have built the state-of-the-art model and analyzed the performance of the QbE-STD system. Based on the analysis, we proposed algorithms that can impact on the performance of the system. We also studied further the limitations and drawbacks of the proposed algorithms. Finally, this thesis concludes by presenting some potential research directions.Item Open Access Analysis of nonlinearity in speech production mechanism for speaker verification: phase-based approach(Dhirubhai Ambani Institute of Information and Communication Technology, 2015) Agrawal, Purvi; Patil, Hemant A.Many of the real-world signal processing problems can be described using linear models, and can be realized as analog or digital filter, time-invariant filters; finite or infinite impulse response (IIR or FIR) filters. In the recent past, a nonlinear operator called Teager Energy Operator (TEO) has been introduced and investigated as it has a small window in temporal-domain, making it ideal for local time analysis of signals. This thesis aims to explore the nonlinear nature of the speech production mechanism of a speaker. There has been significant advancement in exploring the source and system-based features for speaker recognition attributed to the characteristics of the excitation source and size and shape of the vocal tract. In this work, TEO phase features are derived from fullband speech signal and then on subband speech signal (due to the property of the TEO being a monocomponent operator). In addition, a feature set is derived from residual phase extracted from nonlinear filter designed using Volterra-Weiner (VW) series exploiting higher-order linear as well as nonlinear relationships hidden in the sequence of samples of speech signal. Experiments have been performed on the score-level fusion of the proposed feature sets with state-of-the-art MFCC features for text-independent Speaker Verification (SV) task, based on Gaussian Mixture Model-Universal Background Model (GMM-UBM) system, respectively. The performance of each feature set is evaluated and a comparative study of each of the features is presented. The results obtained provide an evaluation of the nature of the speech production mechanism and provides features to improve performance of SV system.Item Open Access Acoustic-to-articulatory inversion: speech quality assessment and smoothness constraint(Dhirubhai Ambani Institute of Information and Communication Technology, 2015) Rajpal, Avni; Patil, Hemant A.The ability of humans to speak effortlessly, require coordinated movements of various articulators, muscles, etc. This effortless movement contributes towards naturalness, intelligibility and speaker identity in human speech, which is only partially present in speech, obtained from most of voice conversion (VC) systems. Hence, during voice conversion, the information related to speech production is lost. For quantification of the loss in information two quantities, i.e., mutual information (I) and estimation error were calculated. In this thesis, the differences in the estimated articulator trajectories are exploited to propose articulatory features based objective measure for assessing the quality of voice conversion. Moreover, a new smoothness criterion, i.e., jerk minimization is explored to deal non-uniqueness of speech inversion mapping. Speech is the result of coordinated movements of the articulators such as lips, tongue, jaw, velum, etc. Therefore, measured trajectories obtained are smooth and slowly varying. However, the trajectories estimated from acoustic-to-articulatory inversion are found to be jagged. Thus, energy minimization is used as smoothness constraint for improving performance of the acoustic-to-articulatory inversion. Moreover, jerk (i.e., rate of change of acceleration) is known for quantification of smoothness in case of human motor movements. This motivates us to propose jerk minimization as the smoothness criteria for frame-based acoustic-to-articulatory inversion.Item Open Access Spectro-temporal features based automatic speech recognition(Dhirubhai Ambani Institute of Information and Communication Technology, 2015) Nagpal, Ankit; Patil, Hemant A.ASR technology has found its application in almost every field in life. Today‟s world cannot be considered as noise-free and deploying ASR technology in such environments would incorporate the challenge to deal with various kinds of noises and channel effects. Thus, robustness of ASR is becoming increasingly important. State-of-the-art Mel Frequency Cepstral Coefficients (MFCC) features capture spectral information and some temporal dynamics in the speech signal. Spectro-temporal features, on the other hand, are more physiologically motivated, as they capture more perceptual information, and are able to perform better in the presence of noise. In this thesis, cepstral analysis, theory of cepstral coefficients (MFCC and Gammatone Frequency Cepstral Coefficients, i.e., GFCC) and motivation to use spectro-temporal features, are discussed. Furthermore, the work presents the theory behind Gabor filters and motivation to incorporate them for ASR task. Algorithm for the extraction of spectro-temporal features- Spectro-Temporal Gabor filterbank features (GBFB), is also presented in detail. Experiments are carried out on TIMIT database, with various additive noises such as white, babble, volvo and high frequency (under various SNR levels) to compare spectro-temporal features, denoted by GBFBmel+MFCC and the proposed GBFBGamm+GFCC (incorporating mel and Gammatone filters, respectively) and the state-of-the-art MFCC features. Experiments are carried out with HTK as back end, taking into account the effectiveness of acoustic and language model. It is concluded that with acoustic modeling only, spectro-temporal Gabor filterbank (GBFB) features (whether incorporating Gammatone filterbank or mel filterbank) when concatenated with cepstral coefficients perform better than the state-of-the-art MFCC features in clean conditions as well as in the presence of various additive noises or signal degradation conditions. This is because GBFB features are able to capture more local joint spectro-temporal information, than the MFCC features, from the speech signal.Item Open Access Vocal tract length normalization for automatic speech recognition(Dhirubhai Ambani Institute of Information and Communication Technology, 2014) Sharma, Shubham; Patil, Hemant A.Various factors affect the performance of Automatic Speech Recognition (ASR) systems. In this thesis, speaker differences due to variations in vocal tract length (VTL) are taken into account. Vocal Tract Length Normalization (VTLN) has become an integral part of ASR systems these days. Different methods have been studied to compensate for these differences in the spectral-domain. In this thesis, various state-of-the-art methods have been implemented and discussed in detail. For example, method of Lee and Rose uses a maximum likelihood-based approach. It implements a grid search over a range of values of warping factors to obtain optimal warping factors for different speakers. On the other hand, method by Umesh et al. uses scale transform to obtain VTL normalized features. Frequency warping is the basis of such normalizing techniques. Mel scale warping is the most acceptable for compensating the speaker differences as it is inspired from the hearing process of human ear. Use of Bark scale–based warping is proposed in this thesis. Bark scale is based on the perception of loudness by human ear in contrast with the Mel scale which is based on pitch perception. Bark scale-based warping provides improved recognition accuracy in case of mismatched conditions (i.e., training on male (or female) speakers and testing on female (or male) speakers). Performances of different methods have been tested for different ASR tasks in English, Gujarati and Marathi languages. TIMIT database is used for English language and details of database collection for Gujarati and Marathi languages have been discussed. The performance provided by using VTLN has shown improvement over state-of-the-art MFCC features alone for almost all applications considered in this thesis. One of the major tasks done in this thesis is the development of Phonetic Engines (PE) using VTLN in three different modes of speech, viz., read, spontaneous and lecture mode in Gujarati and Marathi languages. Lee-Rose method is used for the design of PEs. Improved accuracy is achieved using VTLN-based method as compared to MFCCs. In addition, template matching experiment is performed using various VTL-normalized features under study and MFCCs for application of spoken keyword spotting. Better precision and lower equal error rates (EER) are obtained using VTL-normalized Scale Transform Cepstral Coefficients (STCC). This suggests that VTLN-based features can be useful for bigger applications such as audio search and spoken term detection (STD).Item Open Access Phonetic segmentation: unsupervised approach(Dhirubhai Ambani Institute of Information and Communication Technology, 2013) Vachhani, Bhavikkumar Bhagvanbhai; Patil, Hemant A.Phonetic segmentation can find its potential application for Automatic Speech Recognition (ASR) and Text-to-Speech (TTS) Synthesis systems. In this thesis, we propose use of different spectral features viz., Mel Frequency Cepstral Coefficients (MFCC), Cochlear Filter Cepstral Coefficients (CFCC) and Perceptual Linear Prediction Cepstral Coefficients (PLPCC)-based features to detect spectral transition measure (STM) for automatic phonetic boundaries. We propose a new unsupervised algorithm by combining evidences from state-of-the-art Mel Frequency Cepstral Coefficients (MFCC) and proposed CFCC to improve the accuracy in automatic phonetic boundaries detection process. Using proposed fusion-based approach, we achieve 90 % (i.e., 8 % better than MFCC-based STM alone for 20 ms tolerance interval) accuracy for automatic boundary detection of entire TIMIT database. Using proposed PLPCC-base STM approach, we achieve 85 % (i.e., 3 % better than state-of the art Mel- frequency Cepstral Coefficients (MFCC)-based STM for 20 ms tolerance interval) accuracy and 15 % over-segmentation rate (i.e., 8 % less than MFCC-based STM) for automatic boundary detection of 2, 34, 925 phone boundaries corresponding 630 speakers of entire TIMIT database. The second part of the thesis is focusing on development of various applications using automatically segmented and labeled boundaries.Item Open Access Feature based approach for singer identification(Dhirubhai Ambani Institute of Information and Communication Technology, 2012) Radadia, Purushotam G.; Patil, Hemant A.One of the challenging and difficult problems under the category of Music Information Retrieval (MIR) is to identify a singer of a given song under strong instrumental accompaniments. Besides instrumental sounds, other parameters are also severely affecting Singer IDentification (SID) accuracy, such as quality of song recording devices, transmission channels and other singing voices present within a song. In our work, we propose singer identification with large database of 500 songs (largest database ever used in any of the SID problem) prepared from Hindi (Indian Language) Bollywood songs. In addition, vocal portions are segmented manually from each of the songs. Different features have been employed in addition to state-of-the-art feature set, Mel Frequency Cepstral Coefficients (MFCC) in this thesis work. To identify a singer, three classifiers are employed, viz., 2nd order polynomial classifier, 3rd order polynomial classifier and state-of-the-art GMM classifier. Furthermore, to alleviate the effect of recording devices and transmission channels, Cepstral Mean Subtraction (CMS) technique on MFCC is utilized for singer identification and it is providing better results than the baseline MFCC alone. Moreover, the 3rd order classifier outperforms amongst all three classifiers. Score-level fusion technique of MFCC and CMSMFCC is also used in this thesis and it improves the results significantly.Item Open Access Speaker recognition over VoIP network(Dhirubhai Ambani Institute of Information and Communication Technology, 2011) Goswami, Parth A.; Patil, Hemant A.This thesis deals with the Automatic Speaker Recognition (ASR) system over narrowband Voice over Internet Protocol (VoIP) networks. There are several artifacts of VoIP network such as speech codec, packet loss and packet re-ordering, network jitter & echo. In this thesis, packet loss is considered as the research issue in order to investigate performance degradation for an ASR system, due to packet loss. As the voice packets travel over Internet Protocol (IP) network, they tend to take different routes. Some of them are dropped by the channel due to congestion and some are rejected by the receiver. This packet loss reduces the perceptual quality of speech. Therefore, it is natural to expect that packet loss may affects the performance of an ASR system. To alleviate this degradation in ASR system performance due to packet loss, novel interleaving schemes and lossy training method are proposed. It is shown in the present work that these interleaving schemes and lossy training methods significantly help in improving the performance of an ASR system.Item Open Access Gaussian mixture models for spoken language identification(Dhirubhai Ambani Institute of Information and Communication Technology, 2006) Manwani, Naresh; Mitra, Suman K.; Joshi, ManjunathLanguage Identification (LID) is the problem of identifying the language of any spoken utterance irrespective of the topic, speaker or the duration of the speech. Although A very huge amount of work has been done for automatic Language Identification, accuracy and complexity of LID systems remains major challenges. People have used different methods of feature extraction of speech and have used different baseline systems for learning purpose. To understand the role of these issues a comparative study was conducted over few algorithms. The results of this study were used to select appropriate feature extraction method and the baseline system for LID. Based on the results of the study mentioned above we have used Gaussian Mixture Models (GMM) as our baseline system which are trained using Expectation Maximization (EM) algorithm. Mel Frequency Cepstral Coefficients (MFCC), its delta and delta-delta cepstral coefficients are used as features of speech applied to the system. English and three Indian languages (Hindi, Gujarati and Telugu) are used to test the performances. In this dissertation we have tried to improve the performance of GMM for LID. Two modified EM algorithms are used to overcome the limitations of EM algorithm. The first approach is Split and Merge EM algorithm The second variation is Model Selection Based Self-Splitting Gaussian Mixture Leaning We have also prepared the speech database for three Indian languages namely Hindi, Gujarati and Telugu and that we have used in our experiments.