Multi-Speaker Noise Reduction through Audio-Visual Fusion and Disentanglement

سال انتشار: 1402
نوع سند: مقاله کنفرانسی
زبان: انگلیسی
مشاهده: 183

فایل این مقاله در 9 صفحه با فرمت PDF قابل دریافت می باشد

استخراج به نرم افزارهای پژوهشی:

لینک ثابت به این مقاله:

شناسه ملی سند علمی:

ICPCONF09_131

تاریخ نمایه سازی: 8 مهر 1402

چکیده مقاله:

Recognizing speech from multiple simultaneous speakers is critical for real-world speech applications but suffers from overlapping acoustic noise. This paper proposes a multi-view audio-visual deep learning architecture to reduce noise and improve speech recognition accuracy in multi-talker settings using visual speech cues. A visual processing front-end disentangles blended mouth movements into individual speaker representations using adversarial training. These representations are fused with beamformed audio encodings through temporally synchronized co-attention. Experiments demonstrate significant noise reduction and increased recognition accuracy compared to audio-only methods. The model generalizes well to varying speaker counts and unseen combinations. This audio-visually fused framework enables deploying robust multi-speaker speech recognition without requiring per-speaker training.

نویسندگان

Faramarz Zareian

University of Genova,Computer science, Master Degree, Genova, DIBRIS Departimento di informatica, Bioingegneria, Robotics e Ingegneria dei Sistemi