Journal Article

Permanent URI for this collectionhttps://ir.daiict.ac.in/handle/123456789/37

Browse

Search Results

Now showing 1 - 10 of 38
  • Publication
    On Significance of Constant-Q Transform for Pop Noise Detection
    (Elsevier, 11-06-2023) Khoria, Kuldeep; Patil, Ankur T; Patil, Hemant; DA-IICT, Gandhinagar; Khoria, Kuldeep (201911014); Patil, Ankur T (201621008)
    Liveness detection has emerged as an important research issue for many�biometrics, such as face, iris, hand geometry, etc. and significant research efforts are reported in the literature. However, less emphasis is given to liveness detection for voice biometrics or Automatic Speaker Verification (ASV). Voice Liveness Detection (VLD) can be a potential technique to detect spoofing attacks in�ASV system. Presence of pop noise in the speech signal of live speaker provides the discriminative acoustic cue to distinguish between genuine�vs.�spoofed speech in the framework of VLD. Pop noise comes out as a burst at the lips, which is captured by the ASV system (since the speaker and microphone are close enough), indicating the liveness of the speaker and provides the basis of VLD. In this paper, we present the Constant-Q Transform (CQT) -based approach over the traditional Short-Time Fourier Transform (STFT) -based algorithm (baseline). With respect to Heisenberg�s uncertainty principle in signal processing framework, the CQT has variable spectro-temporal resolution, in particular, better frequency resolution for low frequency region and better temporal resolution for high frequency region, which can be effectively utilized to identify the low frequency characteristics of pop noise. We have also compared proposed algorithm with�cepstral�features, namely, Linear Frequency�Cepstral Coefficients�(LFCC) and Constant-Q Cepstral Coefficients. The experiments are performed on recently released�POp noise COrpus�(POCO) dataset with various statistical, discriminative, and deep learning-based classifiers, namely,�Gaussian Mixture Model�(GMM),�Support Vector Machine�(SVM),�Convolutional Neural Networks�(CNN), Light-CNN (LCNN), and�Residual Network�(ResNet), respectively. The significant improvement in performance, in particular, an absolute improvement of 14.23% and 10.95% in terms of percentage�classification accuracy�on development and evaluation set, respectively, is obtained for the proposed CQT-based algorithm along with SVM classifier, over the STFT-SVM (baseline) system. Similar trend of the�performance improvement�is observed for the GMM, CNN, LCNN, and ResNet classifiers for the proposed CQT-based algorithm�vs.�traditional STFT-based algorithm. The analysis is further extended by simulating the replay mechanism (in the standard framework of ASVSpoof-2019 PA challenge dataset) on the subset of POCO dataset in order to observe the effect of room acoustics onto the performance of the VLD system. By embedding the moderate simulated replay mechanism in POCO dataset, we obtained the percentage�classification accuracy�of 97.82% on evaluation set.
  • Publication
    Replay Spoof Detection Using Energy Separation Based Instantaneous Frequency Estimation From Quadrature and In-Phase Components
    (Elsevier, 01-01-2023) Gupta, Priyanka; Chodingala, Piyush; Patil, Hemant; DA-IICT, Gandhinagar; Gupta, Priyanka (201721001); Chodingala, Piyush (202015002)
    Replay attacks in speech are becoming easier to mount with the advent of high quality of recording and playback devices. This makes these replay attacks a major concern for the security of Automatic Speaker Verification (ASV) systems and�voice assistants. In the past, auditory transform-based as well as�Instantaneous Frequency�(IF)-based features have been proposed for replay spoofed speech detection (SSD). In this context, IF has been estimated either by derivative of analytic phase via�Hilbert transform, or by using high temporal resolution Teager Energy Operator (TEO)-based Energy Separation Algorithm (ESA). However, excellent temporal resolution of ESA comes with lacking in using�relative phase�information and vice-versa. To that effect, we propose novel Cochlear Filter Cepstral Coefficients-based�Instantaneous Frequency�using Quadrature Energy Separation Algorithm (CFCCIF-QESA) features, with excellent temporal resolution as well as relative�phase information. CFCCIF-QESA is designed by exploiting�relative phase shift�to estimate IF, without estimating phase explicitly from the signal. To motivate and validate effectiveness of proposed QESA approach for IF estimation, we have employed information-theoretic measures, such as�Mutual Information�(MI), Kullback�Leibler (KL) divergence, and Jensen�Shannon (JS) divergence. The proposed CFCCIF-QESA feature set is extensively evaluated on standard statistically meaningful ASVSpoof 2017 version2.0 dataset. When evaluated on the ASVSpoof 2017 v2.0 dataset, CFCCIF-QESA achieves improved performance as compared to CFCCIF-ESA and CQCC feature sets on�GMM,�CNN, and LCNN classifiers. Furthermore, in the case of cross-database evaluation using ASVSpoof 2017 v2.0 and VSDC, CFCCIF-QESA also performs relatively better as compared to CFCCIF-ESA and CQCC on�GMM�classifier. However, for the case of self-classification on the ASVSpoof 2019 PA data, CFCCIF-QESA only outperforms CFCCIF-ESA. Whereas, on BTAS 2016 dataset, it performs relatively close to CFCCIF-ESA. Finally, results are presented for the case when the ASV system is not under attack.
  • Publication
    Music footprint recognition via sentiment, identity, and setting identification
    (Springer, 01-07-2022) Phatnani, Kirtana Sunil; Patil, Hemant; DA-IICT, Gandhinagar
    Emotional contagion is said to occur when an origin (i.e., any sensory stimuli) emanating emotions causes the observer to feel the same emotions. In this paper, we explore the identification and quantification of emotional contagion produced by music in human beings. We survey 50 subjects who answer: what type of music they hear when they are happy, excited, sad, angry, and affectionate. In the analysis of the distribution, we observe that predominantly the emotional state of the subjects does influence the choice of�tempo�of the musical piece. We define the footprint in three dimensions, namely, sentiment, time, and identification. We unpack each song by unraveling sentiment analysis in time, using lexicons and tenses, along with the identity via pronouns used. In this study, we wish to quantify and visualize the emotional journey of the listener through music. The results of this can be extended to the elicitation of emotional contagion within any story, poem, and conversations.
  • Publication
    Vulnerability Issues in Automatic Speaker Verification (ASV) Systems
    (ACM DL, 10-02-2024) Gupta, Priyanka; Guido, Rodrigo Capobianco; Patil, Hemant; DA-IICT, Gandhinagar; Gupta, Priyanka (201721001)
    Claimed identities of speakers can be verified by means of automatic speaker verification (ASV) systems, also known as voice biometric systems. Focusing on security and robustness against spoofing attacks on ASV systems, and observing that the investigation of attacker�s perspectives is capable of leading the way to prevent known and unknown threats to ASV systems, several countermeasures (CMs) have been proposed during ASVspoof 2015, 2017, 2019, and 2021 challenge campaigns that were organized during INTERSPEECH conferences. Furthermore, there is a recent initiative to organize the ASVSpoof 5 challenge with the objective of collecting the massive spoofing/deepfake attack data (i.e., phase 1), and the design of a spoofing-aware ASV system using a single classifier for both ASV and CM, to design integrated CM-ASV solutions (phase 2). To that effect, this paper presents a survey on a diversity of possible strategies and vulnerabilities explored to successfully attack an ASV system, such as target selection, unavailability of global countermeasures to reduce the attacker�s chance to explore the weaknesses, state-of-the-art adversarial attacks based on machine learning, and deepfake generation. This paper also covers the�possibility�of attacks, such as hardware attacks on ASV systems. Finally, we also discuss the several technological challenges from the attacker�s perspective, which can be exploited to come up with better defence mechanisms for the security of ASV systems.
  • Publication
    Voice Privacy using CycleGAN and Time-Scale Modification
    (Elsevier, 01-07-2022) Prajapati, Gauri P; Singh, Dipesh Kumar; Amin, Preet P; Patil, Hemant; DA-IICT, Gandhinagar; Prajapati, Gauri P (201911058); Singh, Dipesh Kumar (201911057); Amin, Preet P (201801051)
    Extensive use of Intelligent Personal Assistants (IPA) and�biometrics�in our day-to-day life asks for�privacy preservation�while dealing with�personal data. To that effect, efforts have been made to preserve the personally identifiable characteristics from human voice using different speaker�anonymization�techniques. In this paper, we propose Cycle Consistent�Generative Adversarial Network�(CycleGAN) to modify (transform) the speaker�s gender as well as the other prosodic aspects using their Mel�cepstral coefficients�(MCEPs) and fundamental frequency (i.e.,�). For effective anonymization in the context of voice privacy, we propose two-level (i.e., double) anonymization, where first-level anonymization is done using CycleGAN, followed by second-level anonymization using time-scale modification. The speaker anonymization and intelligibility are measured objectively using the automatic speaker verification (ASV) and�automatic speech recognition�(ASR) experiments, respectively, on development and test sets of�Librispeech�and�VCTK�datasets. For CycleGAN-based anonymization, the average % EERs (% WERs) are 40.3% (8.89%) and 40.95% (9.37%) with original enrollments and anonymized trials of the development and�test datasets, respectively. The average % EERs (% WERs) for double anonymization are 46.19% (9.95%) and 44.76% (10.34%) with original enrollments and anonymized trials of the development and test datasets, respectively. For the voice privacy evaluation , the performance of�ASV system�is much important, when the enrollments and trials both are anonymized (called as A-A case), which is also briefly discussed in this work. The average % EERs for A-A case (test set) are 24.29% and 2.81% using CycleGAN-based anonymization and double anonymization, respectively. Objective evaluation for more advanced attack model (i.e., attacker having anonymized data) is also explored in this study. The performance reflected the robustness of proposed anonymization approach towards voice privacy. The�subjective tests�using 101 listeners and corresponding analysis of variance (ANOVA) and Tukey�Kramer-based Ad-hoc tests are also carried out in order to quote statistical significance of our results. The subjective test show that the CycleGAN and double anonymization approaches give better naturalness, intelligibility, and speaker dissimilarity than the state-of-the-art x-vector-based baseline system.
  • Publication
    Morse wavelet transform-based features for voice liveness detection
    (Elsevier, 01-03-2024) Gupta, Priyanka; Patil, Hemant; DA-IICT, Gandhinagar; Gupta, Priyanka (201721001)
    The need for Voice Liveness Detection (VLD) has emerged particularly for the security of Automatic Speaker Verification (ASV) systems. Existing Spoofed Speech Detection (SSD) systems rely on attack-specific approaches to detect spoofed speech. However, to safeguard ASV systems against�all�the kinds of spoofing attacks (known as well as unknown attacks), determining whether a speech is uttered live (genuine) or not, is important. To that effect, in this work, we propose the detection of pop noise using Morse wavelet for VLD task. Pop noise is a discriminative acoustic cue that is present in live speech and is absent/diminished in spoofed speech. It is captured by the microphone in the form of sudden bursts of air from a live speaker�s�mouth�due to the�close proximity�of the speaker with the microphone. To validate this hypothesis, we present an analysis of pop noise energy w.r.t. distance and found that it decreases exponentially with distance. Furthermore, pop noise is said to be present in very low frequency regions. To capture the pop noise effectively, we propose to exploit the excellent frequency resolution of Continuous�Wavelet Transform�(CWT) using Generalized Morse Wavelets (GMWs). GMWs are a superfamily of analytic wavelets. To that effect, in this work, we have analysed the suitability of GMWs for pop noise detection for VLD task using the POp noise COrpus (POCO). The wavelet parameters are fine-tuned according to the VLD task. Furthermore, the performance of VLD system is evaluated for various subband frequencies, and it is observed that the subband of 1 to��gives the best performance accuracy of 90.55% and 88.43% on the Dev and Eval sets, respectively. In addition, phoneme-based analysis shows the dependence of the performance of the VLD system on the type of phonemes in the utterances. It is shown that phonemes, such as plosives and fricatives show distinct pop noise as compared to other phonemes. Furthermore, the extension of the POCO dataset is used for experiments where simulated�reverberation�is added to spoofed signals, assuming the attacker (or the recording device) is positioned at various distances. This leads to the studying the effect of speaker-attacker distance. Similar to the previous results, it is observed that for the reverberated case too, the�optimal frequency�subband for VLD task is 1 to�, across all the distances. Furthermore, the proposed feature set is evaluated using three classifiers, namely, Convolutional�Neural Network�(CNN), Light CNN (LCNN), and Residual�Neural Network�(ResNet), for POCO dataset as well as reverberated POCO dataset. It is observed that CNN gives the highest accuracy of 88.43% on Eval set of the POCO dataset. Furthermore, the proposed features are also evaluated under the assumptions of two ideal scenarios � when the ASV system is strictly under attack, and when it is strictly not under attack. It is observed that the proposed Morse wavelet-based VLD system rejected 89% of the spoofed utterances, and accepted 88.30% of the genuine utterances.
  • Publication
    Voice privacy using time-scale and pitch modification
    ( S N Computer Science, 27-01-2024) Singh, Dipesh Kumar; Prajapati, Gauri; Patil, Hemant; DA-IICT, Gandhinagar; Singh, Dipesh Kumar (201911057); Prajapati, Gauri (201911058)
    There is a growing demand toward digitization of various day-to-day work and hence, there is a surge in use of Intelligent Personal Assistants. The extensive use of these smart digital assistants asks for security and privacy preservation techniques because they use personally identifiable characteristics of the user. To that effect, various privacy preservation techniques for different types of voice assistants have been explored. Hence, for voice-based digital assistants, we need a privacy preservation technique. Thus, in this study, we explored the prosody modification methods to modify speaker-specific characteristics of the user, so that the modified utterances can then be made publicly available to use for training of different speech-based systems. This study presents three data augmentation techniques as voice anonymization methods to modify the speaker-dependent speech parameters (i.e.,�). The voice anonymization and speech intelligibility are measured objectively using the automatic speaker verification (ASV) and automatic speech recognition (ASR) experiments, respectively, on development and test set of�Librispeech�dataset. For speed perturbation-based anonymization, up to 53.7% relative increased % EER is observed for a perturbation factor,��for both male and female speakers. For the same case, the % WER was adequate (less than the baseline system), reflecting the use of speed perturbation method as anonymization algorithm in a voice privacy system. The similar performance is observed for pitch perturbation with perturbation factor,�. However, the tempo perturbation could not found to be useful for speaker anonymization during the experiments with % EER in the order of 5�10 %.
  • Publication
    Modeling musical expectancy via reinforcement learning and directed graphs
    (Springer, 06-09-2023) Phatnani, Kirtana Sunil; Patil, Hemant; DA-IICT, Gandhinagar
    Algorithms strive to capture the intricacies of our complex world, but translating qualitative aspects into quantifiable data poses a significant challenge. In our paper, we embark on a journey to unveil the hidden structure of music by exploring the interplay between our predictions and the sequence of musical events. Our ultimate goal is to gain insights into how certainty fluctuates throughout a musical piece using a three-fold approach: a listening test, reinforcement learning (RL), and graph construction. Through this approach, we seek to understand how musical expectancy affects physiological measurements, visualize the graphical structure of a composition, and analyze the accuracy of prediction accuracy across 15 musical pieces. We conducted a listening test using western classical music on 50 subjects, monitoring changes in blood pressure, heart rate, and oxygen saturation in response to different segments of the music. We also assessed the accuracy of the RL agent in predicting notes and pitches individually and simultaneously. Our findings reveal that the average accuracy of the RL agent in note and pitch prediction is 64.17% and 22.48%, respectively, while the accuracy for simultaneous prediction is 73.84%. These results give us a glimpse into the minimum level of certainty present across any composition. To further analyze the accuracy of the RL agent, we propose novel directed graphs in our paper. Our analysis shows that the variance of the edge distributions in the graph is inversely proportional to the accuracy of the RL agent. Through this comprehensive study, we hope to shed light on the enigmatic nature of music and pave the way for future research in this fascinating field.
  • Publication
    Multiple voice disorders in the same individual: Investigating handcrafted features, multi-label classification algorithms, and base-learners
    (Elsevier, 01-07-2023) Junior, Sylvio Barbon; Guido, Rodrigo Capobianco; Aguiar, Gabriel Jonas; Santana, Everton José; Junior, Mario Lemes Proença; Patil, Hemant; DA-IICT, Gandhinagar
    At the heart of this platform is a database archiving the performance and execution environment related data of standard parallel algorithm implementations run on different�computing architectures�using different programming environments. The online plotting and analysis tools of our platform can be combined seamlessly with the database to aid self-learning, teaching, evaluation and discussion of different HPC related topics, with a particular focus on a holistic system�s perspective. The user can quantitatively compare and understand the importance of numerous deterministic as well as non-deterministic factors of both the software and the hardware that impact the performance of parallel programs. Instructors of HPC/PDC related courses can use the platform�s tools to illustrate the importance of proper data collection and analysis in understanding factors impacting performance as well as to encourage peer learning among students. Scripts are provided for automatically collecting performance related data, which can then be analyzed using the platform�s tools. The platform also allows students to prepare a standard lab/project report aiding the instructor in uniform evaluation. The platform�s modular design enables easy inclusion of performance related data from contributors as well as addition of new features in the future.
  • Publication
    CQT-Based Cepstral Features for Classification of Normal vs. Pathological Infant Cry
    (IEEE, 27-10-2023) Patil, Hemant; Kachhi, Aastha; Patil, Ankur T; DA-IICT, Gandhinagar; Patil, Ankur T (201621008); Kachhi, Aastha
    Infant cry classification is an important area of research that involves distinguishing between normal and pathological cries. Traditional feature sets, such as Short-Time Fourier Transform (STFT) and Mel Frequency Cepstral Coefficients (MFCC) have shown limitations due to poor spectral resolution caused by quasi-periodic sampling in high pitch-source harmonics. To address this, we propose to use Constant-Q Cepstral Coefficients (CQCC), which leverage geometrically-spaced frequency bins for improved representation of the fundamental frequency (F0) and its harmonics for infant cry classification. Two datasets, Baby Chilanto and In-House DA-IICT, were employed to evaluate the proposed feature set. We compared the CQCC against state-of-the-art feature sets, such as MFCC and Linear Frequency Cepstral Coefficients (LFCC) using Gaussian Mixture Model (GMM) and Support Vector Machine (SVM) classifiers, with 10-fold cross-validation. The CQCC-GMM architecture achieved relatively better accuracy of 99.8% on the Baby Chilanto dataset and 98.24% on the In-House DA-IICT dataset. This work demonstrates the effectiveness of CQCC's form-invariance over traditional STFT-based spectrograms. Additionally, it explores parameter tuning and the impact of feature vector dimensions. The study presents cross-database and combined dataset scenarios, yielding an overall performance improvement of 1.59%. CQCC's robustness was also evaluated under various signal degradation conditions, including additive babble noise at different Signal-to-Noise Ratios (SNR). The performance was further compared with other feature sets using statistical measures, including F1-score, J-statistics, and latency analysis for practical deployment. Lastly, CQCC's results were compared with existing studies on the Baby Chilanto dataset.