M Tech (EC) Dissertations

Permanent URI for this collectionhttp://ir.daiict.ac.in/handle/123456789/6

Browse

Search Results

Now showing 1 - 2 of 2
  • ItemOpen Access
    Phase Based Methods for Various Speech Applications
    (Dhirubhai Ambani Institute of Information and Communication Technology, 2023) Pusuluri, Aditya; Patil, Hemant A.
    Vocal communication plays a fundamental role in human interaction and expression.Right from the first cry to adult speech, the signal conveys information aboutthe well-being of the individual. Lack of coordination between the speech musclesand the brain leads to voice pathologies. Some pathologies related to infants areAsphyxia, Sudden Death Syndrome (SIDS), etc. The other voice pathologies thataffect the speech production systems are dysarthria, cerebral palsy, and parkinson�sdisease.Dysarthria, a neurological motor speech disorder, is characterized by impairedspeech intelligibility that can vary across severity-levels. This works focuses onexploring the importance of Modified Group Delay Cepstral Coefficients (MDGCC)-based features in capturing the distinctive acoustic characteristics associated withdysarthric severity-level classification, particularly for irregularities in speech.Convolutional Neural Network (CNN) and traditional Gaussian Mixture Model(GMM) are used as the classification models in this study. MGDCC is comparedwith state-of-the-art magnitude-based features, namely, Mel Frequency CepstralCoefficients (MFCC) and Linear Frequency Cepstral Coefficients (LFCC). In addition,this work also analyzed the noise robustness of MGDCC. To that effect,experiments were performed on various noise types and SNR levels, where thephenomenal performance of MGDCC over other feature sets was reported. Further,this study also analyses the cross-database scenarios for dysarthric severitylevelclassification. Analysis of Voice onset Time (VOT) and experiments wereperformed using MGDCC to detect dysarthric speech against normal speech. Further,the performance of MGDCC was then compared with baseline features usingprecision, recall, and F-1 score and finally, the latency period was analysed forpractical deployment of the system.This work also explores the application of phase-based features on the emotionrecognition task and pop noise detection. As technological advancementsprogress, dependence on machines is inevitable. Therefore, to facilitate effectiveinteraction between humans and machines, it has become crucial to develop proficienttechniques for Speech Emotion Recognition (SER). The MGDCC featureset is compared against MFCC and LFCC features using a CNN classifier and theLeave One Speaker Out technique. Furthermore, due to the ability of MGDCCto capture the information in low-frequency regions and due to the fact that popnoise occurs at lower frequencies, the application of phase-based features on voiceliveness detection is performed. The results are obtained from a CNN classifierusing the 5-Fold cross-validation metric and are compared against MFCC andLFCC feature sets.This work proposed the time averaging-based features in order to understandthe amount of information being captured across the temporal axis as there wouldnot be many temporal variations in a cry signal. The research conducted in thisstudy utilizes a 10-fold stratified cross-validation approach with machine learningclassifiers, specifically Support Vector Machine (SVM), K-Nearest Neighbor(KNN), and Random Forest (RF). This work also showcased CQT-based Constant-Q Harmonic coefficient (CQHC) and Constant-Q Pitch coefficients (CQPC) for theclassification of infant cry into normal and pathology as an effective representationof the spectral and pitch components of a spectrum together is not achievedleaving scope for improvement. The results are compared by considering theMFCC, LFCC, and CQCC feature sets as the baseline features using machinelearning and deep learning classifiers, such as Convolutional Neural Networks(CNN), Gaussian Mixture Models (GMM), and Support Vector Machines (SVM)with 5-Fold cross-validation accuracy as the metric.
  • ItemOpen Access
    Development of Countermeasures for Voice Liveness and Spoofed Speech Detection
    (Dhirubhai Ambani Institute of Information and Communication Technology, 2022) Chodingala, Piyushkumar Kiritbhai; Patil, Hemant A.
    An Automatic Speaker Verification (ASV) or voice biometric system performs machine based authentication of speakers using voice signals. ASV is a voice biometric system which has applications, such as banking transactions using mobile phones. Personal information, and banking details, demand more robust security of ASV systems. Furthermore, the Voice Assistants (VAs) are also known for the convenience of controlling most of the surrounding devices, such as user�s personal device, door locks, electric appliances, etc. However, these ASV and VA systems are also vulnerable to various spoofing attacks, such as details, twins, Voice Conversion (VC), Speech Synthesis (SS), and replay. In particular, the user�s voice command can be conveniently recorded and played back by the imposter (attacker) with negligible cost. Hence, the most harmful attack (replay attack) of morphing user�s voice command can be performed easily. Hence, this thesis aims to develop countermeasure to protect these ASV and VA systems from replay attacks. In addition, this thesis is also an attempt to develop Voice Liveness Detection (VLD) task as countermeasure for replay attack. In this thesis, the novel Cochlear Filter Cepstral Coefficients based Instanta neous Frequency using Quadrature Energy Separation Algorithm (CFCCIF-QESA) feature set is proposed for replay Spoofed Speech Detection (SSD) on ASV systems. Performance of the proposed feature set is evaluated using publicly avail- able datasets such as, ASVSpoof 2017 v2.0 and BTAS 2016. Furthermore, the significance of Delay and Sum (DAS) beamformer over state of the art Minimum Variance Distortionless Response (MVDR) for replay SSD on VAs. Finally, the wavelet based features are proposed for VLD task. The performance of proposed wavelet-based approaches are evaluated using recently released POp noise COr pus (POCO).