Theses and Dissertations
Permanent URI for this collectionhttp://ir.daiict.ac.in/handle/123456789/1
Browse
11 results
Search Results
Item Open Access Analysis of voice biometric attacks: detection of synthetic vs natural speech(Dhirubhai Ambani Institute of Information and Communication Technology, 2014) S, Adarsa; Patil, Hemant A.The improvement in text-to-speech (TTS) synthesis also poses the problem of biometric attack on speaker verification system. In this context, it is required to analyse the performance of these system for false acceptance rate to impostor using artificial speech and incorporate features in the system to make it robust to these attacks. The aim of this study, is to understand different aspects and hence extract appropriate features for distinction of natural and synthetic speech. The study focuses on understanding those aspects which gives naturalness to human speech that the present day TTS systems fail to capture. Three different aspects, viz., Fourier transform phase, nonlinearity and speech prosody are analysed. The performance of each feature is evaluated and a comparative study of each of the features is presented. The results obtained provides an evaluation of the naturalness of the synthetic speech used and provides features to improve robustness against biometric attacks in speaker verification systems.Item Open Access Vocal tract length normalization for automatic speech recognition(Dhirubhai Ambani Institute of Information and Communication Technology, 2014) Sharma, Shubham; Patil, Hemant A.Various factors affect the performance of Automatic Speech Recognition (ASR) systems. In this thesis, speaker differences due to variations in vocal tract length (VTL) are taken into account. Vocal Tract Length Normalization (VTLN) has become an integral part of ASR systems these days. Different methods have been studied to compensate for these differences in the spectral-domain. In this thesis, various state-of-the-art methods have been implemented and discussed in detail. For example, method of Lee and Rose uses a maximum likelihood-based approach. It implements a grid search over a range of values of warping factors to obtain optimal warping factors for different speakers. On the other hand, method by Umesh et al. uses scale transform to obtain VTL normalized features. Frequency warping is the basis of such normalizing techniques. Mel scale warping is the most acceptable for compensating the speaker differences as it is inspired from the hearing process of human ear. Use of Bark scale–based warping is proposed in this thesis. Bark scale is based on the perception of loudness by human ear in contrast with the Mel scale which is based on pitch perception. Bark scale-based warping provides improved recognition accuracy in case of mismatched conditions (i.e., training on male (or female) speakers and testing on female (or male) speakers). Performances of different methods have been tested for different ASR tasks in English, Gujarati and Marathi languages. TIMIT database is used for English language and details of database collection for Gujarati and Marathi languages have been discussed. The performance provided by using VTLN has shown improvement over state-of-the-art MFCC features alone for almost all applications considered in this thesis. One of the major tasks done in this thesis is the development of Phonetic Engines (PE) using VTLN in three different modes of speech, viz., read, spontaneous and lecture mode in Gujarati and Marathi languages. Lee-Rose method is used for the design of PEs. Improved accuracy is achieved using VTLN-based method as compared to MFCCs. In addition, template matching experiment is performed using various VTL-normalized features under study and MFCCs for application of spoken keyword spotting. Better precision and lower equal error rates (EER) are obtained using VTL-normalized Scale Transform Cepstral Coefficients (STCC). This suggests that VTLN-based features can be useful for bigger applications such as audio search and spoken term detection (STD).Item Open Access Design of syllable-based speech segmentation methods for text-to-speech (TTS) synthesis system for Gujarati(Dhirubhai Ambani Institute of Information and Communication Technology, 2013) Talesara, Swati; Patil, Hemant A.Text-to-speech (TTS) synthesizer has been proved to be an aiding tool for many visually challenged people for reading through hearing feedback. Although there are TTS synthesizers available in English and other languages as well, however, it has been observed that people feel more comfortable in learning in their own native language. Keeping this point in mind, Gujarati TTS synthesizer has been built and the building process has been described in this thesis. This TTS system has been built in Festival speech synthesis framework. Syllable is taken as the basic speech sound unit in building Gujarati TTS synthesizer as Indian languages are syllabic in nature. However, in building this unit-selection-based Gujarati TTS system, one requires large Gujarati labeled corpus. This task of labeling is the most time-consuming and tedious task. This task requires large manual efforts. In this thesis work, an attempt has been made to reduce these manual efforts by automatically generating labeled corpus at syllable-level. To that effect, a Gaussian-based segmentation method has been proposed for automatic segmentation of speech at syllable-level. It has been observed that the percentage correctness of labeled data is around 80 % for both male and female voice as compared to 70 % for group delay-based labeling. In addition, the system built on the proposed approach shows better intelligibility when evaluated by a visually challenged subject. The word error rate is reduced by 5 % for Gaussian filter-based TTS system, compared to group delay-based TTS system. Furthermore, 5 % increment is observed in correctly synthesized words. The main focus of this thesis has been to reduce the manual efforts required in building TTS system (which are the manual efforts required in labeling speech data) for Gujarati language.Item Open Access Studies on transcription, classification and detection of obstruents(Dhirubhai Ambani Institute of Information and Communication Technology, 2013) Malde, Kewal Dhiraj; Patil, Hemant A.Speech is the powerful mode of communication among the people. During the last few decades, there has been growing interest in speech related research all over the world. To develop algorithms for automatic speech recognition (ASR) systems, the requirement of independence of the language and accent are one of the important aspects. Hence, ASR based on automatic phonetic transcription (which is independent of language, accent and the speaker) is a better idea. In this thesis, two objectives, viz., obstruent classification and obstruent detection are explored. In order to get the basic concepts clear related to obstruents (i.e., consonants which are produced by obstructing the airflow either completely or partially), various aspects related to speech are discussed in detail. QAs one of the important requirements for any experiment is the database, hence, details related to standard speech database (viz., TIMIT, used by most of the researchers all over the globe) are discussed. In addition, details of the speech corpora being developed by our (DA-IICT Prosody research) team in two of the Indian languages (viz., Gujarati and Marathi) are also discussed. Phonetic transcription being the core application of the present research work done in this thesis, it is given special importance and explained in detail. All the experiments are performed on TIMIT database as well as speech database in two Indian languages (viz., Gujarati and Marathi languages) which are being developed by DA-IICT DeitY Prosody research team. Experiments for obstruent classification task are performed based on the general method using modulation spectrogram-based features. Experiments for obstruent detection are performed using three methods based on STM contour, chaotic titration and Seneff’s auditory model. As compared to our own developed database, we get consistent classification results (i.e., classification of obstruents, stops and fricatives) using TIMIT database. The reason for this is that TIMIT database has been developed by expert phoneticians whereas our database is not. Also as our database is under development, hence, there were less numbers of phoneme samples (i.e., phonetic transcribed data). Due to less number of transcribed data (from our own database in Gujarati and Marathi languages) along with less variation in accent across speakers, results of classification obtained are better for our own database as compared to TIIMIT database (wherein there is huge variation in accent across speakers). We obtained good classification accuracy (i.e., around 90-99 %) using an optimum feature size of (modulation spectrogram-based feature obtained after feature reduction using HOSVD), 75:25 % training to testing ratio. For obstruent detection task, we obtained 77 %, 99.61 % and 97.61 % of detection efficiency of obstruents using methods based on STM contour, chaotic titration and Seneff’s auditory model (SAM), respectively. Results obtained using the latter two methods is better as compared to the STM contour at the cost of decrease in estimated probability. From the present work, we can say that, modulation spectrogram-based features can be a good option for obstruent classification, however, we need some dimension reduction methods to reduce the size for the feature vector obtained based from modulation spectrogram. As the present work of classification was based on isolated obstruent speech segment, one can extend this work for continuous speech. Seneff’s auditory model with certain modification and proper selection of parametric constant can be used for obstruent detection task.Item Open Access Phonetic segmentation: unsupervised approach(Dhirubhai Ambani Institute of Information and Communication Technology, 2013) Vachhani, Bhavikkumar Bhagvanbhai; Patil, Hemant A.Phonetic segmentation can find its potential application for Automatic Speech Recognition (ASR) and Text-to-Speech (TTS) Synthesis systems. In this thesis, we propose use of different spectral features viz., Mel Frequency Cepstral Coefficients (MFCC), Cochlear Filter Cepstral Coefficients (CFCC) and Perceptual Linear Prediction Cepstral Coefficients (PLPCC)-based features to detect spectral transition measure (STM) for automatic phonetic boundaries. We propose a new unsupervised algorithm by combining evidences from state-of-the-art Mel Frequency Cepstral Coefficients (MFCC) and proposed CFCC to improve the accuracy in automatic phonetic boundaries detection process. Using proposed fusion-based approach, we achieve 90 % (i.e., 8 % better than MFCC-based STM alone for 20 ms tolerance interval) accuracy for automatic boundary detection of entire TIMIT database. Using proposed PLPCC-base STM approach, we achieve 85 % (i.e., 3 % better than state-of the art Mel- frequency Cepstral Coefficients (MFCC)-based STM for 20 ms tolerance interval) accuracy and 15 % over-segmentation rate (i.e., 8 % less than MFCC-based STM) for automatic boundary detection of 2, 34, 925 phone boundaries corresponding 630 speakers of entire TIMIT database. The second part of the thesis is focusing on development of various applications using automatically segmented and labeled boundaries.Item Open Access Objective evaluation of speech quality of text-to-speech (TTS) synthesis systems(Dhirubhai Ambani Institute of Information and Communication Technology, 2013) Sailor, Hardik Bhupendra; Patil, Hemant A.Since the use of Text-to-Speech (TTS) technology is increasing, there is a high demand of TTS system that can produce natural and intelligible voice in any environments. In order to improve speech synthesis system, synthesized speech must be properly evaluated so that the gap of natural speech and synthetic speech can be identified and should be taken care by developing proper methods in each modelling block of TTS systems. This thesis addresses machine evaluation approach known as objective method for speech quality measurement of TTS voice. In this thesis work, conventional techniques for evaluating speech quality of TTS voice as well as recently proposed techniques are used. It has been shown that the conventional techniques like PESQ, spectrogram analysis are not able to justify cues related to speech naturalness. Also, experimental results show that distance-based objective measures using perceptual features, viz., Perceptual Cepstral Distance (PCD) are not appropriate for speech quality evaluation of TTS voice. In order to justify speech naturalness of synthetic speech, recently proposed method based on pitch (i.e., F0) information in speech signal is used. Since the human speech production model is difficult to apply in speech synthesis systems, pitch or fundamental frequency (F0)-related features are used and their direct correlation with subjective scores is obtained. The results on Blizzard challenge speech database shows potential of these features with correlation coefficient of 0.59, however, still it needs to be improved. For speech intelligibility, in this thesis work simple phone recognition method is developed and experiments on CMU ARCTIC data shows good correlation coefficient of -0.77 with MCD measure-generally common measure for speech quality in TTS. As a part of TTS team at DA-IICT, TTS in Gujarati language is developed so that users can be able to communicate with machine in his or her native language. All objective measures discussed in this thesis are applied and compared with subjective scores. Based on experiments, it is evident that objective measures are used only for Statistical Parametric Speech Synthesis (SPSS) system and related technologies since in unit-selection-based TTS, speech output is concatenated version of natural speech sound units.Item Open Access Person recognition using humming, singing and speech(Dhirubhai Ambani Institute of Information and Communication Technology, 2013) Chhayani, Nirav Hitendrabhai; Patil, Hemant A.In this thesis, person recognition system is designed for three different speech-related biometric signals, i.e., humming, singing and normal speech. As humming is nasalised sound, we have approached Mel filterbank-based features for person recognition task rather than LP (Linear Prediction) model. This thesis work is done to observe which biometric pattern performs better amongst three for person recognition task. As we found that person-specific information is not same in any two biometric signals, one should have to observe performance of these biometric signals. The very first task for any person recognition system design is data collection and corpus design. Hence, in this thesis, first, corpus is designed for the humming, singing and speech. In the data collection part, total 50 subjects are selected for the recording purpose. The data collection is done in 4 different sessions for each subject in order to capture intersession variability. Each session consists of testing session of recording for humming, singing and speech. Next to data collection, feature extraction is done with Mel filterbank which follows the human perception for hearing, so Mel Frequency Cepstral Coefficients (MFCC) is used as state-of-the-art feature set. Then using this filterbank, experiment is done for intersession as well as session training-testing set. After that, noise is added to the database and the results are compared to observe the effect of noise viz., evaluation under noisy conditions in robustness performance of the system. Then some modification is also done in feature (Teager Energy Based MFCC) extraction process using TEO (Teager Energy Operator). Results of T-MFCC features are also compared with the results of MFCC feature set. Score-level fusion of T-MFCC and MFCC feature set are also done and results for the same are observed. These observations lead us to the fact that score-level fusion of MFCC and T-MFCC performs better than either of them two individually. This type of score-level fusion increases the performance of the system. For different values of the fusion weight, performance is measured and optimum value for fusion-weight is determined for humming, singing and speech signals. Effect of feature dimensions as well as order of the classifier also observed for intersession experiment. After these studies, inter biometric type experiment is performed. Based on the results obtained in this experiment, Fisher’s F-ratio is determined for all three biometric patterns (i.e., humming, singing and speech). The new structure of filterbank is proposed for all three biometric patterns. The system performance is also measured for this new filterbank and compared with previous all experiments. In all these experiments, person-specific model is generated using polynomial classifier. This classifier considers out-of-class information while creating person-specific model. The experiments were reported for different performance evaluation factors. For example, effect of polynomial classifier order, effect of dimension of feature vector, effect of noisy environments are considered. To evaluate, performance DET curves are used. This is NIST standardized widely accepted performance evaluation measure for speaker recognition application.Item Open Access Feature based approach for singer identification(Dhirubhai Ambani Institute of Information and Communication Technology, 2012) Radadia, Purushotam G.; Patil, Hemant A.One of the challenging and difficult problems under the category of Music Information Retrieval (MIR) is to identify a singer of a given song under strong instrumental accompaniments. Besides instrumental sounds, other parameters are also severely affecting Singer IDentification (SID) accuracy, such as quality of song recording devices, transmission channels and other singing voices present within a song. In our work, we propose singer identification with large database of 500 songs (largest database ever used in any of the SID problem) prepared from Hindi (Indian Language) Bollywood songs. In addition, vocal portions are segmented manually from each of the songs. Different features have been employed in addition to state-of-the-art feature set, Mel Frequency Cepstral Coefficients (MFCC) in this thesis work. To identify a singer, three classifiers are employed, viz., 2nd order polynomial classifier, 3rd order polynomial classifier and state-of-the-art GMM classifier. Furthermore, to alleviate the effect of recording devices and transmission channels, Cepstral Mean Subtraction (CMS) technique on MFCC is utilized for singer identification and it is providing better results than the baseline MFCC alone. Moreover, the 3rd order classifier outperforms amongst all three classifiers. Score-level fusion technique of MFCC and CMSMFCC is also used in this thesis and it improves the results significantly.Item Open Access Speaker recognition over VoIP network(Dhirubhai Ambani Institute of Information and Communication Technology, 2011) Goswami, Parth A.; Patil, Hemant A.This thesis deals with the Automatic Speaker Recognition (ASR) system over narrowband Voice over Internet Protocol (VoIP) networks. There are several artifacts of VoIP network such as speech codec, packet loss and packet re-ordering, network jitter & echo. In this thesis, packet loss is considered as the research issue in order to investigate performance degradation for an ASR system, due to packet loss. As the voice packets travel over Internet Protocol (IP) network, they tend to take different routes. Some of them are dropped by the channel due to congestion and some are rejected by the receiver. This packet loss reduces the perceptual quality of speech. Therefore, it is natural to expect that packet loss may affects the performance of an ASR system. To alleviate this degradation in ASR system performance due to packet loss, novel interleaving schemes and lossy training method are proposed. It is shown in the present work that these interleaving schemes and lossy training methods significantly help in improving the performance of an ASR system.Item Open Access Person recognition from their hum(Dhirubhai Ambani Institute of Information and Communication Technology, 2011) Madhavi, Maulik C.; Patil, Hemant A.In this thesis, design of person recognition system based on person's hum is presented. As hum is nasalized sound and LP (Linear Predication) model does not characterize nasal sounds sufficiently, our approach in this work is based on using Mel filterbank-based cepstral features for person recognition task. The first task was consisted of data collection and corpus design procedure for humming. For this purpose, humming for old Hindi songs from around 170 subjects are used. Then feature extraction schemes were developed. Mel filterbank follows the human perception for hearing, so MFCC was used as state-of- the-art feature set. Then some modifications in filterbank structure were done in order to compute Gaussian Mel scalebased MFCC (GMFCC) and Inverse Mel scale-based MFCC (IMFCC) feature sets. In this thesis mainly two features are explored. First feature set captures the phase information via MFCC utilizing VTEO (Variable length Teager Energy Operator) in time-domain, i.e., MFCC-VTMP and second captures the vocal-source information called as Variable length Teager Energy Operator based MFCC, i.e., VTMFCC. The proposed feature set MFCCVTMP has two characteristics, viz., it captures phase information and other it uses the property of VTEO. VTEO is extension of TEO and it is a nonlinear energy tracking operator. Feature sets like VTMFCC captures the vocal-source information. This information exhibits the excitation mechanism in the speech (hum) production process. It is found to be having complementary nature of information than the vocal tract information. So the score-level fusion based approach of different source and system features improves the person recognition performance.