Theses and Dissertations
Permanent URI for this collectionhttp://ir.daiict.ac.in/handle/123456789/1
Browse
6 results
Search Results
Item Open Access Replay spoof detection using handcrafted features(Dhirubhai Ambani Institute of Information and Communication Technology, 2018) Tapkir, Prasad Anil; Patil, Hemant A.In the past few years, there has been noteworthy demand in the use of Automatic Speaker Verification (ASV) system for numerous applications. The increased use of the ASV systems for voice biometrics purpose comes with the threat of spoofing attacks. The ASV systems are vulnerable to five types of spoofing attacks, namely, impersonation, Voice Conversion (VC), Speech Synthesis (SS), twins and replay. Among which, replay possess a greater threat to the ASV system than any other spoofing attacks, as it neither require any specific expertise nor a sophisticated equipment. Replay attacks require low efforts and most accessible attacks. The replay speech can be modeled as a convolution of the genuine speech with the impulse response of microphone, multimedia speaker, recording environment and playback environment. The detection difficulty of replay attacks increases with a high quality intermediate devices, clean recording and playback environment. In this thesis, we have propose three novel handcrafted cepstral feature sets for replay spoof detection task, namely, Magnitude-based Spectral Root Cepstral Coefficients (MSRCC), Phase-based Spectral Root Cepstral Coefficients (PSRCC) and Empirical Mode Decomposition Cepstral Coefficients (EMDCC). In addition, we explored the significance of Teager Energy Operator (TEO) phase feature for replay spoof detection. The EMDCC feature set replace the filterbank structure with Empirical Mode Decomposition (EMD) technique to obtain the subband signals. The number of subbands obtained for the replay speech signal using EMD is more as compared to the genuine speech signal. The MSRCC and PSRCC feature sets are extracted using spectral root cepstrum of speech signal. The spectral root cepstrum spreads the effect of additional impulse responses in replay speech over entire quefrencydomain. The TEO phase feature set provides the high security information when fused with other magnitude-based features, such as Mel Frequency Cepstral Coefficients (MFCC). The experiments are performed on ASV spoof 2017 challenge database and all the systems are implemented using Gaussian Mixture Model (GMM) as a classifier. All the feature set performs better than the ASV spoof 2017 challenge baseline Constant Q Cepstral Coefficients (CQCC) system.Item Open Access Vocal tract length normalization for automatic speech recognition(Dhirubhai Ambani Institute of Information and Communication Technology, 2014) Sharma, Shubham; Patil, Hemant A.Various factors affect the performance of Automatic Speech Recognition (ASR) systems. In this thesis, speaker differences due to variations in vocal tract length (VTL) are taken into account. Vocal Tract Length Normalization (VTLN) has become an integral part of ASR systems these days. Different methods have been studied to compensate for these differences in the spectral-domain. In this thesis, various state-of-the-art methods have been implemented and discussed in detail. For example, method of Lee and Rose uses a maximum likelihood-based approach. It implements a grid search over a range of values of warping factors to obtain optimal warping factors for different speakers. On the other hand, method by Umesh et al. uses scale transform to obtain VTL normalized features. Frequency warping is the basis of such normalizing techniques. Mel scale warping is the most acceptable for compensating the speaker differences as it is inspired from the hearing process of human ear. Use of Bark scale–based warping is proposed in this thesis. Bark scale is based on the perception of loudness by human ear in contrast with the Mel scale which is based on pitch perception. Bark scale-based warping provides improved recognition accuracy in case of mismatched conditions (i.e., training on male (or female) speakers and testing on female (or male) speakers). Performances of different methods have been tested for different ASR tasks in English, Gujarati and Marathi languages. TIMIT database is used for English language and details of database collection for Gujarati and Marathi languages have been discussed. The performance provided by using VTLN has shown improvement over state-of-the-art MFCC features alone for almost all applications considered in this thesis. One of the major tasks done in this thesis is the development of Phonetic Engines (PE) using VTLN in three different modes of speech, viz., read, spontaneous and lecture mode in Gujarati and Marathi languages. Lee-Rose method is used for the design of PEs. Improved accuracy is achieved using VTLN-based method as compared to MFCCs. In addition, template matching experiment is performed using various VTL-normalized features under study and MFCCs for application of spoken keyword spotting. Better precision and lower equal error rates (EER) are obtained using VTL-normalized Scale Transform Cepstral Coefficients (STCC). This suggests that VTLN-based features can be useful for bigger applications such as audio search and spoken term detection (STD).Item Open Access Design of syllable-based speech segmentation methods for text-to-speech (TTS) synthesis system for Gujarati(Dhirubhai Ambani Institute of Information and Communication Technology, 2013) Talesara, Swati; Patil, Hemant A.Text-to-speech (TTS) synthesizer has been proved to be an aiding tool for many visually challenged people for reading through hearing feedback. Although there are TTS synthesizers available in English and other languages as well, however, it has been observed that people feel more comfortable in learning in their own native language. Keeping this point in mind, Gujarati TTS synthesizer has been built and the building process has been described in this thesis. This TTS system has been built in Festival speech synthesis framework. Syllable is taken as the basic speech sound unit in building Gujarati TTS synthesizer as Indian languages are syllabic in nature. However, in building this unit-selection-based Gujarati TTS system, one requires large Gujarati labeled corpus. This task of labeling is the most time-consuming and tedious task. This task requires large manual efforts. In this thesis work, an attempt has been made to reduce these manual efforts by automatically generating labeled corpus at syllable-level. To that effect, a Gaussian-based segmentation method has been proposed for automatic segmentation of speech at syllable-level. It has been observed that the percentage correctness of labeled data is around 80 % for both male and female voice as compared to 70 % for group delay-based labeling. In addition, the system built on the proposed approach shows better intelligibility when evaluated by a visually challenged subject. The word error rate is reduced by 5 % for Gaussian filter-based TTS system, compared to group delay-based TTS system. Furthermore, 5 % increment is observed in correctly synthesized words. The main focus of this thesis has been to reduce the manual efforts required in building TTS system (which are the manual efforts required in labeling speech data) for Gujarati language.Item Open Access Phonetic segmentation: unsupervised approach(Dhirubhai Ambani Institute of Information and Communication Technology, 2013) Vachhani, Bhavikkumar Bhagvanbhai; Patil, Hemant A.Phonetic segmentation can find its potential application for Automatic Speech Recognition (ASR) and Text-to-Speech (TTS) Synthesis systems. In this thesis, we propose use of different spectral features viz., Mel Frequency Cepstral Coefficients (MFCC), Cochlear Filter Cepstral Coefficients (CFCC) and Perceptual Linear Prediction Cepstral Coefficients (PLPCC)-based features to detect spectral transition measure (STM) for automatic phonetic boundaries. We propose a new unsupervised algorithm by combining evidences from state-of-the-art Mel Frequency Cepstral Coefficients (MFCC) and proposed CFCC to improve the accuracy in automatic phonetic boundaries detection process. Using proposed fusion-based approach, we achieve 90 % (i.e., 8 % better than MFCC-based STM alone for 20 ms tolerance interval) accuracy for automatic boundary detection of entire TIMIT database. Using proposed PLPCC-base STM approach, we achieve 85 % (i.e., 3 % better than state-of the art Mel- frequency Cepstral Coefficients (MFCC)-based STM for 20 ms tolerance interval) accuracy and 15 % over-segmentation rate (i.e., 8 % less than MFCC-based STM) for automatic boundary detection of 2, 34, 925 phone boundaries corresponding 630 speakers of entire TIMIT database. The second part of the thesis is focusing on development of various applications using automatically segmented and labeled boundaries.Item Open Access Objective evaluation of speech quality of text-to-speech (TTS) synthesis systems(Dhirubhai Ambani Institute of Information and Communication Technology, 2013) Sailor, Hardik Bhupendra; Patil, Hemant A.Since the use of Text-to-Speech (TTS) technology is increasing, there is a high demand of TTS system that can produce natural and intelligible voice in any environments. In order to improve speech synthesis system, synthesized speech must be properly evaluated so that the gap of natural speech and synthetic speech can be identified and should be taken care by developing proper methods in each modelling block of TTS systems. This thesis addresses machine evaluation approach known as objective method for speech quality measurement of TTS voice. In this thesis work, conventional techniques for evaluating speech quality of TTS voice as well as recently proposed techniques are used. It has been shown that the conventional techniques like PESQ, spectrogram analysis are not able to justify cues related to speech naturalness. Also, experimental results show that distance-based objective measures using perceptual features, viz., Perceptual Cepstral Distance (PCD) are not appropriate for speech quality evaluation of TTS voice. In order to justify speech naturalness of synthetic speech, recently proposed method based on pitch (i.e., F0) information in speech signal is used. Since the human speech production model is difficult to apply in speech synthesis systems, pitch or fundamental frequency (F0)-related features are used and their direct correlation with subjective scores is obtained. The results on Blizzard challenge speech database shows potential of these features with correlation coefficient of 0.59, however, still it needs to be improved. For speech intelligibility, in this thesis work simple phone recognition method is developed and experiments on CMU ARCTIC data shows good correlation coefficient of -0.77 with MCD measure-generally common measure for speech quality in TTS. As a part of TTS team at DA-IICT, TTS in Gujarati language is developed so that users can be able to communicate with machine in his or her native language. All objective measures discussed in this thesis are applied and compared with subjective scores. Based on experiments, it is evident that objective measures are used only for Statistical Parametric Speech Synthesis (SPSS) system and related technologies since in unit-selection-based TTS, speech output is concatenated version of natural speech sound units.Item Open Access Speech enhancement using microphone array for hands-free speech applications(Dhirubhai Ambani Institute of Information and Communication Technology, 2004) Vichare, Chirag Vishwas; Chakka, VijaykumarThis thesis addresses the problem of multi-microphone speech enhancement using GSVD (Generalized singular value decomposition) based optimal filtering algorithm. This algorithm does not require any sensitive geometric information about the array layout, hence is more robust to deviations from the assumed signal model (e.g. look direction error, microphone mismatch, speech detection errors) as compared to conventional multi-microphone noise reduction techniques such as beamforming. However, high computational complexity of this algorithm makes it unsuitable for practical implementation. The work presented in this thesis discusses a recursive version of this algorithm in which GSVD of the data and noise matrices at any instant are updated using GSVD of the matrices available at previous instant as new data arrives in. It is shown that this recursive GSVD updating scheme reduces the computational complexity of this algorithm drastically making it amenable to practical implementation. Various issues related to its implementation are addressed. This thesis also explores the possibility of further reduction in computational complexity, by incorporating GSVD based optimal filtering algorithm in Generalized Side lobe Canceller (GSC) type structure in detail without causing any performance degradation in terms of background noise reduction and speech quality.