Deep Learning Approach for Robust Voice Activity Detection: Integrating CNN and Self-Attention with Multi-Resolution MFCC

سال انتشار: 1403
نوع سند: مقاله ژورنالی
زبان: انگلیسی
مشاهده: 134

فایل این مقاله در 12 صفحه با فرمت PDF قابل دریافت می باشد

استخراج به نرم افزارهای پژوهشی:

لینک ثابت به این مقاله:

شناسه ملی سند علمی:

JR_JADM-12-3_002

تاریخ نمایه سازی: 11 دی 1403

چکیده مقاله:

Voice Activity Detection (VAD) plays a vital role in various audio processing applications, such as speech recognition, speech enhancement, telecommunications, satellite phone, and noise reduction. The performance of these systems can be enhanced by utilizing an accurate VAD method. In this paper, multiresolution Mel- Frequency Cepstral Coefficients (MRMFCCs), their first and secondorder derivatives (delta and delta۲), are extracted from speech signal and fed into a deep model. The proposed model begins with convolutional layers, which are effective in capturing local features and patterns in the data. The captured features are fed into two consecutive multi-head self-attention layers. With the help of these two layers, the model can selectively focus on the most relevant features across the entire input sequence, thus reducing the influence of irrelevant noise. The combination of convolutional layers and self-attention enables the model to capture both local and global context within the speech signal. The model concludes with a dense layer for classification. To evaluate the proposed model, ۱۵ different noise types from the NoiseX-۹۲ corpus have been used to validate the proposed method in noisy condition. The experimental results show that the proposed framework achieves superior performance compared to traditional VAD techniques, even in noisy environments.

کلیدواژه ها:

Voice Activity Detection ، self-attention mechanism ، multi-resolution Mel-Frequency Cepstral Coefficients ، deep learning

نویسندگان

Khadijeh Aghajani

Department of computer Engineering, Faculty of Engineering and Technology, University of Mazandaran, Babolsar, Iran.

مراجع و منابع این مقاله:

لیست زیر مراجع و منابع استفاده شده در این مقاله را نمایش می دهد. این مراجع به صورت کاملا ماشینی و بر اساس هوش مصنوعی استخراج شده اند و لذا ممکن است دارای اشکالاتی باشند که به مرور زمان دقت استخراج این محتوا افزایش می یابد. مراجعی که مقالات مربوط به آنها در سیویلیکا نمایه شده و پیدا شده اند، به خود مقاله لینک شده اند :
  • M. W. Mak, & H. B. Yu, “A study of ...
  • Woo, K. Ho, T. Yang, K. Park, and C. Lee. ...
  • T. H. Zaw, and N. War, “The combination of spectral ...
  • Y. Kida, T. Kawahara, “Voice activity detection based on optimally ...
  • F. Tao, & C. Busso, “Bimodal Recurrent Neural Network for ...
  • X.L. Zhang, & D. Wang, “Boosted deep neural networks and ...
  • S. H. Chen, R. C. Guido, T. K. Truong, & ...
  • S. M. Joseph, & A. P. Babu, “Wavelet energy based ...
  • D. Ying, Y. Yan, J. Dang, & F. K. Soong, ...
  • Z. Shen, J. Wei, W. Lu, & J. Dang, “Voice ...
  • N. Esfandian, F. Jahani Bahnamiri, & S. Mavaddati, “Voice activity ...
  • H. Veisi, & H. Sameti, “Hidden-Markov-model-based voice activity detector with ...
  • X. Liu, Y. Liang, Y. Lou, H. Li, and B. ...
  • B. Liu, J. Tao, F. Mo, Y. Li, Z. Wen, ...
  • N. Ryant, M. Liberman, & J. Yuan,”Speech activity detection on ...
  • Y. Jung, Y. Kim, H. Lim, & H. Kim, “Linear-scale ...
  • Y. Jung, Y. Choi, & H. Kim, “Self-adaptive soft voice ...
  • A. Sehgal, & N. Kehtarnavaz, “A convolutional neural network smartphone ...
  • M. H. Faridh, & U. S. Zulpratita, “HiVAD: A Voice ...
  • P. Vecchiotti, F. Vesperini, E. Principi, S. Squartini, & F. ...
  • P. Vecchiotti, E. Principi, S. Squartini, & F. Piazza, “ ...
  • S. Mihalache, & D. Burileanu, “Using voice activity detection and ...
  • R. Lin, C. Costello, C. Jankowski, & V. Mruthyunjaya, “Optimizing ...
  • N. Wilkinson, & T. Niesler, “ A hybrid CNN-BiLSTM voice ...
  • M. Ovaska, J. Kultanen, T. Autto, J. Uusnäkki, A. Kariluoto, ...
  • R. Zazo, T. N. Sainath, G. Simko, & C. Parada, ...
  • G. Gelly, J.L. & Gauvain, “ Optimization of RNN-based speech ...
  • J. Jia, P. Zhao, & D. Wang, “A Real-Time Voice ...
  • G. Dahy, A. Darwish,& A. E. Hassanein, Robust Voice Activity ...
  • Y. Korkmaz, Y., & A. Boyacı, . Hybrid voice activity ...
  • A. Sofer, & S. E. Chazan, “CNN self-attention voice activity ...
  • J. Thienpondt, & K. Demuynck,” Speaker Embeddings With Weakly Supervised ...
  • J. S. Garofolo et al., TIMIT Acoustic-Phonetic Continuous Speech Corpus, ...
  • A. Varga H. J. M. Steeneken, Assessment for automatic speech ...
  • R. Zhang, P. H. Li, K. w. Liang, & P. ...
  • K. Raut, S. Kulkarni, & A. Sawant, . Multimodal Spatio-Temporal ...
  • S. Alimi, & O. Awodele, Voice activity detection: Fusion of ...
  • S. Dwijayanti, K. Yamamori, & M. Miyoshi, Enhancement of speech ...
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, ...
  • J. Kim, & M. Hahn, “Voice activity detection using an ...
  • نمایش کامل مراجع