Journal Article

Permanent URI for this collectionhttps://ir.daiict.ac.in/handle/123456789/37

Browse

Search Results

Now showing 1 - 8 of 8
  • Publication
    On Significance of Constant-Q Transform for Pop Noise Detection
    (Elsevier, 11-06-2023) Khoria, Kuldeep; Patil, Ankur T; Patil, Hemant; DA-IICT, Gandhinagar; Khoria, Kuldeep (201911014); Patil, Ankur T (201621008)
    Liveness detection has emerged as an important research issue for many�biometrics, such as face, iris, hand geometry, etc. and significant research efforts are reported in the literature. However, less emphasis is given to liveness detection for voice biometrics or Automatic Speaker Verification (ASV). Voice Liveness Detection (VLD) can be a potential technique to detect spoofing attacks in�ASV system. Presence of pop noise in the speech signal of live speaker provides the discriminative acoustic cue to distinguish between genuine�vs.�spoofed speech in the framework of VLD. Pop noise comes out as a burst at the lips, which is captured by the ASV system (since the speaker and microphone are close enough), indicating the liveness of the speaker and provides the basis of VLD. In this paper, we present the Constant-Q Transform (CQT) -based approach over the traditional Short-Time Fourier Transform (STFT) -based algorithm (baseline). With respect to Heisenberg�s uncertainty principle in signal processing framework, the CQT has variable spectro-temporal resolution, in particular, better frequency resolution for low frequency region and better temporal resolution for high frequency region, which can be effectively utilized to identify the low frequency characteristics of pop noise. We have also compared proposed algorithm with�cepstral�features, namely, Linear Frequency�Cepstral Coefficients�(LFCC) and Constant-Q Cepstral Coefficients. The experiments are performed on recently released�POp noise COrpus�(POCO) dataset with various statistical, discriminative, and deep learning-based classifiers, namely,�Gaussian Mixture Model�(GMM),�Support Vector Machine�(SVM),�Convolutional Neural Networks�(CNN), Light-CNN (LCNN), and�Residual Network�(ResNet), respectively. The significant improvement in performance, in particular, an absolute improvement of 14.23% and 10.95% in terms of percentage�classification accuracy�on development and evaluation set, respectively, is obtained for the proposed CQT-based algorithm along with SVM classifier, over the STFT-SVM (baseline) system. Similar trend of the�performance improvement�is observed for the GMM, CNN, LCNN, and ResNet classifiers for the proposed CQT-based algorithm�vs.�traditional STFT-based algorithm. The analysis is further extended by simulating the replay mechanism (in the standard framework of ASVSpoof-2019 PA challenge dataset) on the subset of POCO dataset in order to observe the effect of room acoustics onto the performance of the VLD system. By embedding the moderate simulated replay mechanism in POCO dataset, we obtained the percentage�classification accuracy�of 97.82% on evaluation set.
  • Publication
    Replay Spoof Detection Using Energy Separation Based Instantaneous Frequency Estimation From Quadrature and In-Phase Components
    (Elsevier, 01-01-2023) Gupta, Priyanka; Chodingala, Piyush; Patil, Hemant; DA-IICT, Gandhinagar; Gupta, Priyanka (201721001); Chodingala, Piyush (202015002)
    Replay attacks in speech are becoming easier to mount with the advent of high quality of recording and playback devices. This makes these replay attacks a major concern for the security of Automatic Speaker Verification (ASV) systems and�voice assistants. In the past, auditory transform-based as well as�Instantaneous Frequency�(IF)-based features have been proposed for replay spoofed speech detection (SSD). In this context, IF has been estimated either by derivative of analytic phase via�Hilbert transform, or by using high temporal resolution Teager Energy Operator (TEO)-based Energy Separation Algorithm (ESA). However, excellent temporal resolution of ESA comes with lacking in using�relative phase�information and vice-versa. To that effect, we propose novel Cochlear Filter Cepstral Coefficients-based�Instantaneous Frequency�using Quadrature Energy Separation Algorithm (CFCCIF-QESA) features, with excellent temporal resolution as well as relative�phase information. CFCCIF-QESA is designed by exploiting�relative phase shift�to estimate IF, without estimating phase explicitly from the signal. To motivate and validate effectiveness of proposed QESA approach for IF estimation, we have employed information-theoretic measures, such as�Mutual Information�(MI), Kullback�Leibler (KL) divergence, and Jensen�Shannon (JS) divergence. The proposed CFCCIF-QESA feature set is extensively evaluated on standard statistically meaningful ASVSpoof 2017 version2.0 dataset. When evaluated on the ASVSpoof 2017 v2.0 dataset, CFCCIF-QESA achieves improved performance as compared to CFCCIF-ESA and CQCC feature sets on�GMM,�CNN, and LCNN classifiers. Furthermore, in the case of cross-database evaluation using ASVSpoof 2017 v2.0 and VSDC, CFCCIF-QESA also performs relatively better as compared to CFCCIF-ESA and CQCC on�GMM�classifier. However, for the case of self-classification on the ASVSpoof 2019 PA data, CFCCIF-QESA only outperforms CFCCIF-ESA. Whereas, on BTAS 2016 dataset, it performs relatively close to CFCCIF-ESA. Finally, results are presented for the case when the ASV system is not under attack.
  • Publication
    Vulnerability Issues in Automatic Speaker Verification (ASV) Systems
    (ACM DL, 10-02-2024) Gupta, Priyanka; Guido, Rodrigo Capobianco; Patil, Hemant; DA-IICT, Gandhinagar; Gupta, Priyanka (201721001)
    Claimed identities of speakers can be verified by means of automatic speaker verification (ASV) systems, also known as voice biometric systems. Focusing on security and robustness against spoofing attacks on ASV systems, and observing that the investigation of attacker�s perspectives is capable of leading the way to prevent known and unknown threats to ASV systems, several countermeasures (CMs) have been proposed during ASVspoof 2015, 2017, 2019, and 2021 challenge campaigns that were organized during INTERSPEECH conferences. Furthermore, there is a recent initiative to organize the ASVSpoof 5 challenge with the objective of collecting the massive spoofing/deepfake attack data (i.e., phase 1), and the design of a spoofing-aware ASV system using a single classifier for both ASV and CM, to design integrated CM-ASV solutions (phase 2). To that effect, this paper presents a survey on a diversity of possible strategies and vulnerabilities explored to successfully attack an ASV system, such as target selection, unavailability of global countermeasures to reduce the attacker�s chance to explore the weaknesses, state-of-the-art adversarial attacks based on machine learning, and deepfake generation. This paper also covers the�possibility�of attacks, such as hardware attacks on ASV systems. Finally, we also discuss the several technological challenges from the attacker�s perspective, which can be exploited to come up with better defence mechanisms for the security of ASV systems.
  • Publication
    Morse wavelet transform-based features for voice liveness detection
    (Elsevier, 01-03-2024) Gupta, Priyanka; Patil, Hemant; DA-IICT, Gandhinagar; Gupta, Priyanka (201721001)
    The need for Voice Liveness Detection (VLD) has emerged particularly for the security of Automatic Speaker Verification (ASV) systems. Existing Spoofed Speech Detection (SSD) systems rely on attack-specific approaches to detect spoofed speech. However, to safeguard ASV systems against�all�the kinds of spoofing attacks (known as well as unknown attacks), determining whether a speech is uttered live (genuine) or not, is important. To that effect, in this work, we propose the detection of pop noise using Morse wavelet for VLD task. Pop noise is a discriminative acoustic cue that is present in live speech and is absent/diminished in spoofed speech. It is captured by the microphone in the form of sudden bursts of air from a live speaker�s�mouth�due to the�close proximity�of the speaker with the microphone. To validate this hypothesis, we present an analysis of pop noise energy w.r.t. distance and found that it decreases exponentially with distance. Furthermore, pop noise is said to be present in very low frequency regions. To capture the pop noise effectively, we propose to exploit the excellent frequency resolution of Continuous�Wavelet Transform�(CWT) using Generalized Morse Wavelets (GMWs). GMWs are a superfamily of analytic wavelets. To that effect, in this work, we have analysed the suitability of GMWs for pop noise detection for VLD task using the POp noise COrpus (POCO). The wavelet parameters are fine-tuned according to the VLD task. Furthermore, the performance of VLD system is evaluated for various subband frequencies, and it is observed that the subband of 1 to��gives the best performance accuracy of 90.55% and 88.43% on the Dev and Eval sets, respectively. In addition, phoneme-based analysis shows the dependence of the performance of the VLD system on the type of phonemes in the utterances. It is shown that phonemes, such as plosives and fricatives show distinct pop noise as compared to other phonemes. Furthermore, the extension of the POCO dataset is used for experiments where simulated�reverberation�is added to spoofed signals, assuming the attacker (or the recording device) is positioned at various distances. This leads to the studying the effect of speaker-attacker distance. Similar to the previous results, it is observed that for the reverberated case too, the�optimal frequency�subband for VLD task is 1 to�, across all the distances. Furthermore, the proposed feature set is evaluated using three classifiers, namely, Convolutional�Neural Network�(CNN), Light CNN (LCNN), and Residual�Neural Network�(ResNet), for POCO dataset as well as reverberated POCO dataset. It is observed that CNN gives the highest accuracy of 88.43% on Eval set of the POCO dataset. Furthermore, the proposed features are also evaluated under the assumptions of two ideal scenarios � when the ASV system is strictly under attack, and when it is strictly not under attack. It is observed that the proposed Morse wavelet-based VLD system rejected 89% of the spoofed utterances, and accepted 88.30% of the genuine utterances.
  • Publication
    Voice privacy using time-scale and pitch modification
    ( S N Computer Science, 27-01-2024) Singh, Dipesh Kumar; Prajapati, Gauri; Patil, Hemant; DA-IICT, Gandhinagar; Singh, Dipesh Kumar (201911057); Prajapati, Gauri (201911058)
    There is a growing demand toward digitization of various day-to-day work and hence, there is a surge in use of Intelligent Personal Assistants. The extensive use of these smart digital assistants asks for security and privacy preservation techniques because they use personally identifiable characteristics of the user. To that effect, various privacy preservation techniques for different types of voice assistants have been explored. Hence, for voice-based digital assistants, we need a privacy preservation technique. Thus, in this study, we explored the prosody modification methods to modify speaker-specific characteristics of the user, so that the modified utterances can then be made publicly available to use for training of different speech-based systems. This study presents three data augmentation techniques as voice anonymization methods to modify the speaker-dependent speech parameters (i.e.,�). The voice anonymization and speech intelligibility are measured objectively using the automatic speaker verification (ASV) and automatic speech recognition (ASR) experiments, respectively, on development and test set of�Librispeech�dataset. For speed perturbation-based anonymization, up to 53.7% relative increased % EER is observed for a perturbation factor,��for both male and female speakers. For the same case, the % WER was adequate (less than the baseline system), reflecting the use of speed perturbation method as anonymization algorithm in a voice privacy system. The similar performance is observed for pitch perturbation with perturbation factor,�. However, the tempo perturbation could not found to be useful for speaker anonymization during the experiments with % EER in the order of 5�10 %.
  • Publication
    Modeling musical expectancy via reinforcement learning and directed graphs
    (Springer, 06-09-2023) Phatnani, Kirtana Sunil; Patil, Hemant; DA-IICT, Gandhinagar
    Algorithms strive to capture the intricacies of our complex world, but translating qualitative aspects into quantifiable data poses a significant challenge. In our paper, we embark on a journey to unveil the hidden structure of music by exploring the interplay between our predictions and the sequence of musical events. Our ultimate goal is to gain insights into how certainty fluctuates throughout a musical piece using a three-fold approach: a listening test, reinforcement learning (RL), and graph construction. Through this approach, we seek to understand how musical expectancy affects physiological measurements, visualize the graphical structure of a composition, and analyze the accuracy of prediction accuracy across 15 musical pieces. We conducted a listening test using western classical music on 50 subjects, monitoring changes in blood pressure, heart rate, and oxygen saturation in response to different segments of the music. We also assessed the accuracy of the RL agent in predicting notes and pitches individually and simultaneously. Our findings reveal that the average accuracy of the RL agent in note and pitch prediction is 64.17% and 22.48%, respectively, while the accuracy for simultaneous prediction is 73.84%. These results give us a glimpse into the minimum level of certainty present across any composition. To further analyze the accuracy of the RL agent, we propose novel directed graphs in our paper. Our analysis shows that the variance of the edge distributions in the graph is inversely proportional to the accuracy of the RL agent. Through this comprehensive study, we hope to shed light on the enigmatic nature of music and pave the way for future research in this fascinating field.
  • Publication
    Multiple voice disorders in the same individual: Investigating handcrafted features, multi-label classification algorithms, and base-learners
    (Elsevier, 01-07-2023) Junior, Sylvio Barbon; Guido, Rodrigo Capobianco; Aguiar, Gabriel Jonas; Santana, Everton José; Junior, Mario Lemes Proença; Patil, Hemant; DA-IICT, Gandhinagar
    At the heart of this platform is a database archiving the performance and execution environment related data of standard parallel algorithm implementations run on different�computing architectures�using different programming environments. The online plotting and analysis tools of our platform can be combined seamlessly with the database to aid self-learning, teaching, evaluation and discussion of different HPC related topics, with a particular focus on a holistic system�s perspective. The user can quantitatively compare and understand the importance of numerous deterministic as well as non-deterministic factors of both the software and the hardware that impact the performance of parallel programs. Instructors of HPC/PDC related courses can use the platform�s tools to illustrate the importance of proper data collection and analysis in understanding factors impacting performance as well as to encourage peer learning among students. Scripts are provided for automatically collecting performance related data, which can then be analyzed using the platform�s tools. The platform also allows students to prepare a standard lab/project report aiding the instructor in uniform evaluation. The platform�s modular design enables easy inclusion of performance related data from contributors as well as addition of new features in the future.
  • Publication
    CQT-Based Cepstral Features for Classification of Normal vs. Pathological Infant Cry
    (IEEE, 27-10-2023) Patil, Hemant; Kachhi, Aastha; Patil, Ankur T; DA-IICT, Gandhinagar; Patil, Ankur T (201621008); Kachhi, Aastha
    Infant cry classification is an important area of research that involves distinguishing between normal and pathological cries. Traditional feature sets, such as Short-Time Fourier Transform (STFT) and Mel Frequency Cepstral Coefficients (MFCC) have shown limitations due to poor spectral resolution caused by quasi-periodic sampling in high pitch-source harmonics. To address this, we propose to use Constant-Q Cepstral Coefficients (CQCC), which leverage geometrically-spaced frequency bins for improved representation of the fundamental frequency (F0) and its harmonics for infant cry classification. Two datasets, Baby Chilanto and In-House DA-IICT, were employed to evaluate the proposed feature set. We compared the CQCC against state-of-the-art feature sets, such as MFCC and Linear Frequency Cepstral Coefficients (LFCC) using Gaussian Mixture Model (GMM) and Support Vector Machine (SVM) classifiers, with 10-fold cross-validation. The CQCC-GMM architecture achieved relatively better accuracy of 99.8% on the Baby Chilanto dataset and 98.24% on the In-House DA-IICT dataset. This work demonstrates the effectiveness of CQCC's form-invariance over traditional STFT-based spectrograms. Additionally, it explores parameter tuning and the impact of feature vector dimensions. The study presents cross-database and combined dataset scenarios, yielding an overall performance improvement of 1.59%. CQCC's robustness was also evaluated under various signal degradation conditions, including additive babble noise at different Signal-to-Noise Ratios (SNR). The performance was further compared with other feature sets using statistical measures, including F1-score, J-statistics, and latency analysis for practical deployment. Lastly, CQCC's results were compared with existing studies on the Baby Chilanto dataset.