Theses and Dissertations

Permanent URI for this collectionhttp://ir.daiict.ac.in/handle/123456789/1

Browse

Search Results

Now showing 1 - 3 of 3
  • ItemOpen Access
    Classification of Pathological Infant Cries and Dysarthric Severity-Level
    (Dhirubhai Ambani Institute of Information and Communication Technology, 2022) Kachhi, Aastha Bidhenbhai; Patil, Hemant A.; Sailor, Hardik B.
    Vocal communication is the most important part of any individual�s life to convey their needs. Right from the first cry of neonates to the matured adult speech, required proper brain co-ordination. Any kind of lack in coordination between brain and speech producing system leads to pathology. Asphyxia, asthma, Sudden Death Syndrome, Deaf (SIDS), etc. are some of teh infant cry pathologies and neuromotor speech disorders such as Dysarthria, Parkinson�s Disease, Cere- bal Palsy, etc. are some of the adult speech-related pathologies. These pathologies lead to damaged or paralysed articulatory movements in speech production and rendering unintelligible words. Infants as well as adults suffering from any of the pathologies face difficulties in conveying the emotions. The infant cry classification and analysis is a highly non invasive method for identifying the reason behind the crying. The present work in this thesis is directed towards analysing and classifying the normal vs. pathological cries using signal processing approaches. Various signal processing methods, such as Constant Q Transform (CQT), Heisenberg�s Uncertainty Principle (U-Vector) and Teager Energy Operator (TEO) are analysed in this thesis. Spectrographic analysis using ten different cry modes in a cry signal is also analysed in this work. In addition to this, an attempt has also been made to analyse various pathologies using the form invariance property of the CQT. In addition to the infant cry analysis, classification of normal vs. pathological cries using 10 fold cross validation on Gaussian Mixture Model (GMM) and Support Vector Machine (SVM) have been adopted. In recent the years, dysarthria has also become one of the major speech technology issue for models, such as Automatic Speech Recognition systems. Dysarthric severity-level classification, has gained immense attention via researchers in the recent years. The dysarthric severity level classification aids in knowing the advancement of the disease, and it�s treatment. In this thesis, the dysarthric speech has been analysed using various signal processing operators, such as TEO, and Linear Energy Operator (LEO) for four different dysarthric severity level against normal speech. With increasing use of artificial intelligence, there has been a significant increase in the use of deep learn- ing methods for pattern classification task. To that effect, the severity level classifi- cation of dysarthric speech, deep learning techniques, such as Convolutional Neural Network (CNN), Light CNN (LCNN), and Residual Neural Network (ResNet) have been adopted. Finally, the performance of various signal processing-based feature has been measured using various performance evaluation methods, such as F1-Score, J-Statics, Matthew�s Correlation Coefficient (MCC), Jaccard�s Index, Hamming Loss, Linear Discriminant Analysis (LDA), and latency period for the better practical deployment of the system.
  • ItemOpen Access
    Design of spoof speech detection system : teager energy-based approach
    (Dhirubhai Ambani Institute of Information and Communication Technology, 2021) Kamble, Madhu R.; Patil, Hemant A.
    Automatic Speaker Verification (ASV) systems are vulnerable to various spoofing attacks, namely, Speech Synthesis (SS), Voice Conversion (VC), Replay, and Impersonation. The study of spoofing countermeasures has become increasingly important and is currently a critical area of research, which is the principal objective of this thesis. With the development of Neural Networkbased techniques, in particular, for machine generated spoof speech signals, the performance of Spoof Speech Detection (SSD) system will be further challenging. To encourage the development of countermeasures that are based on signal processing techniques or neural network-based features for SSD task, a standardized dataset was provided by the organizers of ASVspoof challenge campaigns during 2015, 2017, and 2019. The front-end features extracted from the speech signal has a huge impact in the field of signal processing applications. The goal of feature extraction is to estimate the meaningful information directly from the speech signal that can be helpful to the pattern classifier, speech, speaker, emotion recognition, etc. Among various spoofing attacks, speech synthesis, voice conversion, and replay attacks have been identified as the most effective and accessible forms of spoofing. Accordingly, this thesis investigates and develops a framework to extract the discriminative features to deflect these three spoofing attacks. The main contribution of the thesis is to propose various feature sets as frontend countermeasures for SSD task using a traditional Gaussian Mixture Model (GMM)-based classification system. The feature sets are based on Teager Energy Operator (TEO) and Energy Separation Algorithm (ESA), namely, Teager Energy Cepstral Coefficients (TECC), Energy Separation Algorithm Instantaneous Frequency Cepstral Coefficients (ESA-IFCC), Energy Separation Algorithm Instantaneous Amplitude Cepstral Coefficients (ESA IACC), Amplitude Weighted Frequency Cepstral Coefficients (AWFCC), Gabor Teager Filterbank (GTFB). The motivation behind using TEO is its nonlinear speech production property. The true total source energy is known to be estimated using TEO, and it also preserves the amplitude and frequency modulation of a resonant signal and hence, it improves the time-frequency resolution along with improving the formant information representation. In addition, the TEO also has the noise suppression property and it attempts to remove the distortion caused by noise signal. In Chapter 3, we analyze the replay speech signal in terms of reverberation that occurs during recording of the speech signal. The reverberation introduces delay and change in amplitude producing close copies of speech signal which significantly influences the replay components. To that effect, we propose to exploit the capabilities of Teager Energy Operator (TEO) to estimate running estimate of subband energies for replay vs. genuine signal. We have used linearly-spaced Gabor filterbank to obtain narrowband filtered signal. The TEO has the property to track the instantaneous changes of a signal. In Chapter 4, we propose Instantaneous Amplitude (IA) and Instantaneous Frequency (IF) features using Energy Separation Algorithm (ESA). The speech signal is passed through bandpass filters in order to obtain narrowband components because speech is a combination of several monocomponent signals. To obtain a narrowband filtered signal, we have used linearly-spaced Butterworth and Gabor filterbank. The instantaneous modulations helps to understand the local characteristics of a non-stationary signal. These IA and IF components are able to capture the information present in a slowly-varying amplitude envelope and fast-varying frequency. The slowvarying temporal modulations for replay speech have the distorted amplitude envelope, and the fast-varying temporal modulation do not preserve the harmonic structure compared to the natural speech signal. For replay speech signal, the intermediate device characteristics and acoustic environment distorts the spectral energy compared to the natural speech energy. In Chapter 5, we extend our earlier work with the generalized TEO, i.e., by varying the samples of past and future instants with a constant arbitrary integer k also known as lag parameter or dependency index, and named it as Variable length Teager Energy Operator (VTEO). In Chapter 6, we propose the combination of Amplitude Modulation and Frequency Modulation (AM-FM) features for replay Spoof Speech Detection (SSD) task. The AM components are known to be affected by noise (in this case, due to replay mechanism). In particular, we explore this damage in AM component to corresponding Instantaneous Frequency (IF) for SSD task. Thus, the novelty of proposed AmplitudeWeighted Frequency Cepstral Coefficients (AWFCC) feature set lies in using frequency components along with squared weighted amplitude components that are degraded due to replay noise. The AWFCC features contains the information of both AM and FM components together and hence, gave discriminatory information in the spectral characteristics. The first motivation in this thesis is to develop various countermeasures for SSD task. The experimental results on the standard spoofing database shows that proposed feature sets perform better than the corresponding baseline systems. Inspired by the success in the SSD task, we applied TEO-based feature set in a variety of speech and audio processing applications, namely, Automatic Speech Recognition (ASR), Acoustic Scene Sound Classification (ASC), Voice Assistant (VA), and Whisper Speech Detection (WSD). In all these applications, our TEObased feature set gave consistently better performance compared to their respective baselines.
  • ItemOpen Access
    Deep Learning for Severity Level-based Classification of Dysarthria
    (2021) Gupta, Siddhant; Patil, Hemant A.
    Dysarthria is a motor speech disorder in which muscles required to speak somehow gets damaged or paralyzed resulting in an adverse effect to the articulatory elements in the speech and rendering the output voice unintelligible. Dysarthria is considered to be one of the most common form of speech disorders. Dysarthria occurs as a result of several neurological and neuro-degenerative diseases, such as Parkinson’s Disease, Cerebral palsy, etc. People suffering from dysarthria face difficulties in conveying vocal messages and emotions, which in many cases transform into depression and social isolation amongst the individuals. Dysarthria has become a major speech technology issue as the systems that work efficiently for normal speech, such as Automatic Speech Recognition systems, do not provide satisfactory results for corresponding dysarthric speech. In addition, people suffering from dysarthria are generally limited by their motor functions. Therefore, development of voice assisted systems for them become all the more crucial. Furthermore, analysis and classification of dysarthric speech can be useful in tracking the progression of disease and its treatment in a patient. In this thesis, dysarthria has been studied as a speech technology problem to classify dysarthric speech into four severity-levels. Since, people with dysarthria face problem during long speech utterances, short duration speech segments (maximum 1s) have been used for the task, to explore the practical applicability of the thesis work. In addition, analysis of dysarthric speech has been done using different methods such as time-domain waveforms, Linear prediction profile, Teager Energy Operator profile, Short-Time Fourier Transform etc., to distinguish the best representative feature for the classification task. With the rise in Artificial Intelligence, deep learning techniques have been gaining significant popularity in the machine classification and pattern recognition tasks. Therefore, to keep the thesis work relevant, several machine learning and deep learning techniques, such as Gaussian Mixture Models (GMM), Convolutional Neural Network (CCN), Light Convolutional Neural Network (LCNN), and Residual Neural Network (ResNet) have been adopted. The severity levelbased classification task has been evaluated on various popular measures such as, classification accuracy and F1-scores. In addition, for comparison with the short duration speech, classification has also been done on long duration speech (more than 1 sec) data. Furthermore, to enhance the relevance of the work, experiments have been performed on statically meaningful and widely used Universal Access-Speech Corpus.