Improving the Performance of Speaker Recognition System Using Optimized VGG Convolutional Neural Network and Data Augmentation

M. Sharif-Noughabi; S. M. Razavi; S. Mohamadzadeh

Improving the Performance of Speaker Recognition System Using Optimized VGG Convolutional Neural Network and Data Augmentation

محل انتشار: ماهنامه بین المللی مهندسی، دوره: 38، شماره: 10

سال انتشار: 1404

نوع سند: مقاله ژورنالی

زبان: انگلیسی

مشاهده: 185

فایل این مقاله در 12 صفحه با فرمت PDF قابل دریافت می باشد

دریافت فایل کامل مقاله

صدور گواهی نمایه سازی
من نویسنده این مقاله هستم

استخراج به نرم افزارهای پژوهشی:

لینک ثابت به این مقاله:

https://civilica.com/doc/2220211

شناسه ملی سند علمی:

JR_IJE-38-10_017

تاریخ نمایه سازی: 26 فروردین 1404

چکیده مقاله:

One of the methods that have gained attention in recent years is the extraction of Mel-spectrogram images from speech signals and the use of speaker recognition systems. This permits us to utilize existing image recognition methods for this purpose. Three-second segments of the speech are randomly chosen in this paper and then the Mel-spectrogram image of that segment is derived. These images are inputted into a proposed convolutional neural network that has been designed and optimized based on VGG-۱۳. Compared to similar tasks, this optimized classifier has fewer parameters, and it trains faster and has a higher level of accuracy. For the voxceleb۱ dataset with ۱۲۵۱ speakers, the accuracy of top-۱ = ۸۴.۲۵% and top-۵ = ۹۴.۳۳% has been achieved. In addition, various methods have been employed to augment data based on these images, ensuring the speech's nature remains intact, and in most cases, it improves the system's performance. The utilization of data agumentation techniques, such as flip horizontal and time shifting of images or ES technique, led to an increase in top-۱ to ۹۱.۱۷% and top-۵ to ۹۷.۳۲%. Moreover, by employing the Dropout layer output of the proposed neural network as a feature vector during training of the GMM-UBM model, the EER rate in the speaker verification system is decreased. These features reduce the EER value by ۹% for the MFCC feature to ۳.۵%.

کلیدواژه ها:

speaker recognition ، VGG convolutional neural network ، Mel-spectrogram images ، Data Augmentation

نویسندگان

M. Sharif-Noughabi

Department of Electrical and Computer Engineering, University of Birjand, Birjand, Iran

S. M. Razavi

Department of Electrical and Computer Engineering, University of Birjand, Birjand, Iran

S. Mohamadzadeh

Department of Electrical and Computer Engineering, University of Birjand, Birjand, Iran

مراجع و منابع این مقاله:

لیست زیر مراجع و منابع استفاده شده در این مقاله را نمایش می دهد. این مراجع به صورت کاملا ماشینی و بر اساس هوش مصنوعی استخراج شده اند و لذا ممکن است دارای اشکالاتی باشند که به مرور زمان دقت استخراج این محتوا افزایش می یابد. مراجعی که مقالات مربوط به آنها در سیویلیکا نمایه شده و پیدا شده اند، به خود مقاله لینک شده اند :

Yadav S, Rai A, editors. Learning discriminative features for speaker ...
Chaiani M, Bengherabi M, Selouani SA, Boudraa M, editors. Dysarthric ...
Jing X, Ma J, Zhao J, Yang H, editors. Speaker ...
Aghajani K, Afrakoti IEP. Speech emotion recognition using scalogram based ...
Esfandian N. Phoneme Classification Using Temporal Tracking of Speech Clusters ...
Jung J-W, Heo H-S, Yang I, Shim H-J, Yu H-J. ...
Chakraborty S, Parekh R, editors. An improved approach to open ...
Gade VSR, Manickam S. Speaker recognition using Improved Butterfly Optimization ...
Anand P, Singh AK, Srivastava S, Lall B. Few shot ...
Li Y, Wang W, Chen H, Cao W, Li W, ...
An NN, Thanh NQ, Liu Y. Deep CNNs with self-attention ...
Cai W, Chen J, Li M. Exploring the encoding layer ...
Nagrani A, Chung JS, Zisserman A. Voxceleb: a large-scale speaker ...
Ding S, Chen T, Gong X, Zha W, Wang Z. ...
Wang X, Meng J, Wen B, Xue F. RACP: A ...
Li Y, Chen H, Cao W, Huang Q, He Q. ...
Li J, Tian Y, Lee T, editors. Convolution-based channel-frequency attention ...
Vaessen N, Van Leeuwen DA, editors. Fine-tuning wav۲vec۲ for speaker ...
Jin R, Ablimit M, Hamdulla A. Speaker verification based on ...
Farhoodi M, Eshlaghy AT, Motadel M. A proposed model for ...
Farsiani S, Izadkhah H, Lotfi S. An optimum end-to-end text-independent ...
Medjdoubi A, Meddeber M, Yahyaoui K. Smart city surveillance: Edge ...
Simonyan K, Zisserman A. Very deep convolutional networks for large-scale ...
Ioffe S, Szegedy C, editors. Batch normalization: Accelerating deep network ...
Variani E, Lei X, McDermott E, Moreno IL, Gonzalez-Dominguez J, ...

نمایش کامل مراجع