A Comparison of CQT Spectrogram with STFT-based Acoustic Features in Deep Learning-based Synthetic Speech Detection

P. Abdzadeh; H. Veisi

A Comparison of CQT Spectrogram with STFT-based Acoustic Features in Deep Learning-based Synthetic Speech Detection

محل انتشار: مجله هوش مصنوعی و داده کاوی، دوره: 11، شماره: 1

سال انتشار: 1402

نوع سند: مقاله ژورنالی

زبان: انگلیسی

مشاهده: 267

فایل این مقاله در 12 صفحه با فرمت PDF قابل دریافت می باشد

دریافت فایل کامل مقاله

صدور گواهی نمایه سازی
من نویسنده این مقاله هستم

این مقاله در بخشهای موضوعی زیر دسته بندی شده است:

هوش مصنوعی > یادگیری عمیق

استخراج به نرم افزارهای پژوهشی:

لینک ثابت به این مقاله:

https://civilica.com/doc/1627464

شناسه ملی سند علمی:

JR_JADM-11-1_010

تاریخ نمایه سازی: 20 فروردین 1402

چکیده مقاله:

Automatic Speaker Verification (ASV) systems have proven to bevulnerable to various types of presentation attacks, among whichLogical Access attacks are manufactured using voiceconversion and text-to-speech methods. In recent years, there has beenloads of work concentrating on synthetic speech detection, and with the arrival of deep learning-based methods and their success in various computer science fields, they have been a prevailing tool for this very task too. Most of the deep neural network-based techniques forsynthetic speech detection have employed the acoustic features basedon Short-Term Fourier Transform (STFT), which are extracted from theraw audio signal. However, lately, it has been discovered that the usageof Constant Q Transform's (CQT) spectrogram can be a beneficialasset both for performance improvement and processing power andtime reduction of a deep learning-based synthetic speech detection. In this work, we compare the usage of the CQT spectrogram and some most utilized STFT-based acoustic features. As lateral objectives, we consider improving the model's performance as much as we can using methods such as self-attention and one-class learning. Also, short-duration synthetic speech detection has been one of the lateral goals too. Finally, we see that the CQT spectrogram-based model not only outperforms the STFT-based acoustic feature extraction methods but also reduces the processing time and resources for detecting genuine speech from fake. Also, the CQT spectrogram-based model places wellamong the best works done on the LA subset of the ASVspoof ۲۰۱۹ dataset, especially in terms of Equal Error Rate.

کلیدواژه ها:

Voice Spoofing Detection ، Deep Neural Networks ، Voice Biometrics ، Deepfake Audio Detection ، synthetic speech detection

نویسندگان

P. Abdzadeh

Faculty of New Sciences and Technologies, University of Tehran, Tehran, Iran.

H. Veisi

Faculty of New Sciences and Technologies, University of Tehran, Tehran, Iran.

مراجع و منابع این مقاله:

لیست زیر مراجع و منابع استفاده شده در این مقاله را نمایش می دهد. این مراجع به صورت کاملا ماشینی و بر اساس هوش مصنوعی استخراج شده اند و لذا ممکن است دارای اشکالاتی باشند که به مرور زمان دقت استخراج این محتوا افزایش می یابد. مراجعی که مقالات مربوط به آنها در سیویلیکا نمایه شده و پیدا شده اند، به خود مقاله لینک شده اند :

D. A. Reynolds, “Speaker identification and verification using Gaussian mixture ...
Z. Wu, N. Evans, T. Kinnunen, J. Yamagishi, F. Alegre, ...
Yee Wah Lau, M. Wagner, and D. Tran, “Vulnerability of ...
Z. Wu and H. Li, “Voice conversion and spoofing attack ...
Z. Wu, S. Gao, E. S. Cling, and H. Li, ...
M. Todisco et al., “ASVspoof ۲۰۱۹: Future Horizons in Spoofed ...
Z. Wu et al., “ASVspoof ۲۰۱۵: the first automatic speaker ...
J. Yamagishi et al., “Asvspoof ۲۰۱۹: The ۳rd automatic speaker ...
J. Yamagishi et al., “ASVspoof ۲۰۲۱: accelerating progress in spoofed ...
M. Alzantot, Z. Wang, and M. B. Srivastava, “Deep Residual ...
A. Gomez-Alanis, A. M. Peinado, J. A. Gonzalez, and A. ...
C.-I. Lai, N. Chen, J. Villalba, and N. Dehak, “ASSERT: ...
Z. Wu, R. K. Das, J. Yang, and H. Li, ...
K. Aghajani, “Audio-visual emotion recognition based on a deep convolutional ...
B. Z. Mansouri, H. R. Ghaffary, and A. Harimi, “Speech ...
J. C. Brown, “Calculation of a constant Q spectral transform,” ...
M. Todisco, H. Delgado, and N. Evans, “A New Feature ...
X. Li, X. Wu, H. Lu, X. Liu, and H. ...
J. Monteiro, J. Alam, and T. H. Falk, “Generalized end-to-end ...
H. Tak, J. Jung, J. Patino, M. Todisco, and N. ...
Z. Huang, S. Wang, and K. Yu, “Angular Softmax for ...
M. Sahidullah et al., “UIAI System for Short-Duration Speaker Verification ...
S. Wang, Z. Huang, Y. Qian, and K. Yu, “Discriminative ...
Y. Jung, Y. Choi, H. Lim, and H. Kim, “A ...
M. R. Kamble, H. B. Sailor, H. A. Patil, and ...
Z. Wu, X. Xiao, E. S. Chng, and H. Li, ...
M. Sahidullah, T. Kinnunen, and C. Hanilci, “A Comparison of ...
M. Todisco, H. Delgado, and N. W. Evans, “A New ...
B. Chettri, D. Stoller, V. Morfi, M. A. M. Ramírez, ...
X. Fang, H. Du, T. Gao, L. Zou, and Z. ...
M. Pal, A. Raikar, A. Panda, and S. K. Kopparapu, ...
H. Ma, J. Yi, J. Tao, Y. Bai, Z. Tian, ...
R. Jaiswal, D. Fitzgerald, E. Coyle, and S. Rickard, “Towards ...
Z. Weiping, Y. Jiantao, X. Xiaotao, L. Xiangtao, and P. ...

نمایش کامل مراجع