Theses and Dissertations

Permanent URI for this collectionhttp://ir.daiict.ac.in/handle/123456789/1

Browse

Search Results

Now showing 1 - 2 of 2
  • ItemOpen Access
    Feature for Live and Spoofed Speech Detection
    (Dhirubhai Ambani Institute of Information and Communication Technology, 2023) Gupta, Priyanka; Patil, Hemant A.
    The authorization to access specific information is given by a biometric system.Biometric systems are used for security purposes in a way that they prevent unauthorized access to important information or data (information privacy). The accessgranted by the biometric is done by capturing traits of humans, which make allhuman beings unique w.r.t. that particular trait. This thesis focuses on voicebased biometric systems, also known as Automatic Speaker Verification (ASV)systems, given that speech is the most natural and powerful form of communication used by humans to communicate with the outside world. It is the most intuitive, simple, and easy-to-produce characteristic. Since ASV systems have beenused for applications, such as in banking transactions and access to buildings associated with classified information, only authorized legitimate or genuine usersare granted access.ASV systems suffer from vulnerabilities to attacks and can be compromisedat various stages. The attacks may be categorized as direct and indirect attacks,depending on the extent of the attacker�s accessibility to the ASV framework. Besides, due to the recent commercial success of several Intelligent Personal Assistants (IPAs), also known as voice assistants, such as Speech Interpretation andRecognition Interface (SIRI), Amazon Alexa, Google Home, and so on, manyvoice-enabled devices in Internet of Things (IoT) have been commonly prone tospoofing attacks. To that effect, there is active research in the direction of designing countermeasure systems for ASV systems, particularly for spoofing attacks,namely, Speech Synthesis (SS), Voice Conversion (VC), and replay.This thesis is a humble attempt to alleviate some of the research gaps in designing features for countermeasure systems. In particular, this thesis proposesQuadrature Energy Separation Algorithm (QESA) in the light of incorporating thequadrature-phase component with the in-phase component of the signal. To thateffect, an existing feature set for replay Spoofed Speech Detection (SSD), namely,CFCCIF-ESA is extended to the CFCCIF-QESA feature set for enhanced performance of the countermeasure system. The performance of the proposed CFCCIFQESA feature set is evaluated on various datasets for various spoofing attacksgiven in the literature. Furthermore, the existing Linear Frequency Residual Cepstral Coefficients (LFRCC) feature set is optimized w.r.t. to its Linear Prediction(LP) order for the replay SSD task. In particular, it is found that the LP orderneeded for a good prediction of speech is not the same as that needed for thereplay SSD task. The resulting optimized LFRCC feature set is evaluated on theASVSpoof 2019 PA dataset. In addition to this, another feature, known as the uncertainty vector (u-vector), is developed from the Heisenberg�s uncertainty principle in the signal processing framework. The proposed u-vector is evaluated usingthe ASVSpoof 2017 dataset for replay attacks.Furthermore, in the direction to make countermeasure systems independent ofthe type of spoofing attack, features have been proposed for the Voice LivenessDetection (VLD) task. VLD is performed by the detection of pop noise which is thediscriminating acoustic cue present in live speech, produced due to the breathingeffect captured by the microphone when the speaker�s mouth is close to the microphone. The work on VLD in this thesis is based on two key hypotheses, namely,Parseval�s energy equivalence for STFT, CWT, and analytic CWT, whereas the second hypothesis is that the energy of pop noise decreases with the distance of a microphone from the speaker that is used to capture genuine speech. The proposedfeatures for VLD in this thesis are wavelet-based, wherein three wavelets are used,namely, Bump, Morlet, and Morse wavelet, where Morse wavelet is presented as asuperfamily of analytic wavelets, called as Generalized Morse Wavelets (GMWs).Detailed experimental analysis such as speaker-microphone proximity, the effectof phoneme type, and the effect of frequency range is studied.Apart from this, the security of speech data is also taken into account and thisthesis proposes an improved Voice Privacy (VP) system, which is based on Linear Prediction (LP) of speech. Furthermore, the VP system is studied along withthe attacker�s perspective using the target selection approach, and particularly,target selection w.r.t. twins is studied, wherein the most vulnerable twin-pair(i.e., target) is selected. Lastly, some of the proposed feature sets in this thesis arealso evaluated for tasks related to other Assistive Speech Technologies (AST) applications, such as the classification of healthy vs. pathological infant cries, anddysarthric severity-level classification.
  • ItemOpen Access
    Environmental Sound Classification (ESC) using Handcrafted and Learned Features
    (Dhirubhai Ambani Institute of Information and Communication Technology, 2017) Agrawal, Dharmeshkumar Maheshchandra; Patil, Hemant A.
    "Environmental Sound Classification (ESC) is an important research field due to its application in various field such as hearing aids, road surveillance system for security and safety purpose, etc. ESC task was earlier done using Coefficients (MFCC) feature set and Gaussian Mixture Model (GMM) classifier. Recently, deep-learning based approaches are used for ESC task such as Convolutional Neural Network (CNN) based classification which built an end-to-end system for ESC on CNN framework. The ESC task is a quite challenging problem as of environmental sounds that contains the various categories of sounds are difficult to classify. In this thesis, we proposed two new and different feature sets for ESC task, namely, handcrafted feature set (i.e., signal processing-based approach), and data-driven feature set (i.e., machine learning-based approach). In handcrafted feature set, we propose to use modified Gammatone filterbank with Teager Energy Operator (TEO) for ESC task. In this thesis, we have used two classifiers, namely, GMM using cepstral features, and CNN using spectral features. We performed experiments on two datasets, namely, ESC-50, and UrbanSound8K. We compared TEO-based coefficients with MFCC and Gammatone cepstral coefficients (GTCC), in which GTCC used mean square energy. The result shows score-level fusion of proposed TEO-based Gammatone feature-set and MFCC gave better performance than MFCC on both datasets by using GMM and CNN classifiers. This shows that proposed TEO-based Gammatone features contain complementary information, which is helpful in ESC task. In data-driven feature set, we use Convolutional Restricted Boltzmann Machine (ConvRBM) to learn filterbank from the raw audio signals. ConvRBM is a generative model trained in an unsupervised way to model the audio signals of arbitrary lengths. ConvRBM is trained using annealed dropout technique and parameters are optimized using Adam optimization. The subband filters of ConvRBM learned from the ESC-50 database resemble Fourier basis in the mid-frequency range, while some of the low frequency subband filters resemble Gammatone basis. We have used our proposed model as a front-end for the ESC task with supervised CNN as a back-end."