Theses and Dissertations

Automatic Speaker Verification (ASV) systems are vulnerable to various spoofing attacks, namely, Speech Synthesis (SS), Voice Conversion (VC), Replay, and Impersonation. The study of spoofing countermeasures has become increasingly important and is currently a critical area of research, which is the principal objective of this thesis. With the development of Neural Networkbased techniques, in particular, for machine generated spoof speech signals, the performance of Spoof Speech Detection (SSD) system will be further challenging. To encourage the development of countermeasures that are based on signal processing techniques or neural network-based features for SSD task, a standardized dataset was provided by the organizers of ASVspoof challenge campaigns during 2015, 2017, and 2019. The front-end features extracted from the speech signal has a huge impact in the field of signal processing applications. The goal of feature extraction is to estimate the meaningful information directly from the speech signal that can be helpful to the pattern classifier, speech, speaker, emotion recognition, etc. Among various spoofing attacks, speech synthesis, voice conversion, and replay attacks have been identified as the most effective and accessible forms of spoofing. Accordingly, this thesis investigates and develops a framework to extract the discriminative features to deflect these three spoofing attacks. The main contribution of the thesis is to propose various feature sets as frontend countermeasures for SSD task using a traditional Gaussian Mixture Model (GMM)-based classification system. The feature sets are based on Teager Energy Operator (TEO) and Energy Separation Algorithm (ESA), namely, Teager Energy Cepstral Coefficients (TECC), Energy Separation Algorithm Instantaneous Frequency Cepstral Coefficients (ESA-IFCC), Energy Separation Algorithm Instantaneous Amplitude Cepstral Coefficients (ESA IACC), Amplitude Weighted Frequency Cepstral Coefficients (AWFCC), Gabor Teager Filterbank (GTFB). The motivation behind using TEO is its nonlinear speech production property. The true total source energy is known to be estimated using TEO, and it also preserves the amplitude and frequency modulation of a resonant signal and hence, it improves the time-frequency resolution along with improving the formant information representation. In addition, the TEO also has the noise suppression property and it attempts to remove the distortion caused by noise signal. In Chapter 3, we analyze the replay speech signal in terms of reverberation that occurs during recording of the speech signal. The reverberation introduces delay and change in amplitude producing close copies of speech signal which significantly influences the replay components. To that effect, we propose to exploit the capabilities of Teager Energy Operator (TEO) to estimate running estimate of subband energies for replay vs. genuine signal. We have used linearly-spaced Gabor filterbank to obtain narrowband filtered signal. The TEO has the property to track the instantaneous changes of a signal. In Chapter 4, we propose Instantaneous Amplitude (IA) and Instantaneous Frequency (IF) features using Energy Separation Algorithm (ESA). The speech signal is passed through bandpass filters in order to obtain narrowband components because speech is a combination of several monocomponent signals. To obtain a narrowband filtered signal, we have used linearly-spaced Butterworth and Gabor filterbank. The instantaneous modulations helps to understand the local characteristics of a non-stationary signal. These IA and IF components are able to capture the information present in a slowly-varying amplitude envelope and fast-varying frequency. The slowvarying temporal modulations for replay speech have the distorted amplitude envelope, and the fast-varying temporal modulation do not preserve the harmonic structure compared to the natural speech signal. For replay speech signal, the intermediate device characteristics and acoustic environment distorts the spectral energy compared to the natural speech energy. In Chapter 5, we extend our earlier work with the generalized TEO, i.e., by varying the samples of past and future instants with a constant arbitrary integer k also known as lag parameter or dependency index, and named it as Variable length Teager Energy Operator (VTEO). In Chapter 6, we propose the combination of Amplitude Modulation and Frequency Modulation (AM-FM) features for replay Spoof Speech Detection (SSD) task. The AM components are known to be affected by noise (in this case, due to replay mechanism). In particular, we explore this damage in AM component to corresponding Instantaneous Frequency (IF) for SSD task. Thus, the novelty of proposed AmplitudeWeighted Frequency Cepstral Coefficients (AWFCC) feature set lies in using frequency components along with squared weighted amplitude components that are degraded due to replay noise. The AWFCC features contains the information of both AM and FM components together and hence, gave discriminatory information in the spectral characteristics. The first motivation in this thesis is to develop various countermeasures for SSD task. The experimental results on the standard spoofing database shows that proposed feature sets perform better than the corresponding baseline systems. Inspired by the success in the SSD task, we applied TEO-based feature set in a variety of speech and audio processing applications, namely, Automatic Speech Recognition (ASR), Acoustic Scene Sound Classification (ASC), Voice Assistant (VA), and Whisper Speech Detection (WSD). In all these applications, our TEObased feature set gave consistently better performance compared to their respective baselines.

Automatic Speaker Verification (ASV) systems and Voice Assistants (VAs) are highly vulnerable to the spoofing attacks. Spoofing refers to an intentional circumvention wherein an imposter tries to manipulate a biometric system simply by masquerading as another genuinely enrolled person. ASV systems are vulnerable to five kinds of spoofing attacks, namely, Speech Synthesis (SS), Voice Conversion (VC), Impersonation, Twins, and Replay. Replay attack on voice biometric, refers to the fraudulent attempt made by an imposter to spoof another person’s identity by replaying the pre-recorded voice samples in front of an Automatic Speaker Verification (ASV) system. Amongst all the spoofing attack, replay attack is the most simple to execute but hard to detect. In particular, replay attack on ASV system or VAs done using a high quality recording and playback device is very hard to detect as it is very similar to the genuine speaker. Given the vulnerabilities of replay spoofing attacks on ASV and VA systems, this thesis aims at developing effective countermeasures to protect these systems from such malicious attempts. In this thesis five novel feature sets are developed for replay spoof detection task. Out of these five the first three, namely, Cochlear Filter Cepstral Coefficients Instantaneous Frequency using Energy Separation Algorithm (CFCCIF-ESA), Enhanced Teager Energy Cepstral Coefficients (ETECC), and u-vector are used for replay detection on ASV systems whereas Cross-Teager Energy Cepstral Coefficients (CTECC), and Spectral Root Cepstral Coefficients (SRCC) is used for replay detection on VA systems. Performance of the proposed feature sets is evaluated using two datasets, namely, ASVspoof 2017 version 2.0 dataset for replay detection on ASV systems, and Realistic Replay Attack Microphone Array Speech Corpus (ReMASC) used for replay detection on VA systems. Results obtained are compared against the baseline Constant Q Cepstral Coefficients (CQCC), Linear Frequency Cepstral Coefficients (LFCC), and state-of-the-art Mel Frequency Cepstral Coefficients (MFCC) feature sets.

Theses and Dissertations

Browse

Filters

Settings

Sort By

Results per page

Search Results