Multi-Speaker Noise Reduction through Audio-Visual Fusion and Disentanglement
سال انتشار: 1402
نوع سند: مقاله کنفرانسی
زبان: انگلیسی
مشاهده: 183
فایل این مقاله در 9 صفحه با فرمت PDF قابل دریافت می باشد
- صدور گواهی نمایه سازی
- من نویسنده این مقاله هستم
استخراج به نرم افزارهای پژوهشی:
شناسه ملی سند علمی:
ICPCONF09_131
تاریخ نمایه سازی: 8 مهر 1402
چکیده مقاله:
Recognizing speech from multiple simultaneous speakers is critical for real-world speech applications but suffers from overlapping acoustic noise. This paper proposes a multi-view audio-visual deep learning architecture to reduce noise and improve speech recognition accuracy in multi-talker settings using visual speech cues. A visual processing front-end disentangles blended mouth movements into individual speaker representations using adversarial training. These representations are fused with beamformed audio encodings through temporally synchronized co-attention. Experiments demonstrate significant noise reduction and increased recognition accuracy compared to audio-only methods. The model generalizes well to varying speaker counts and unseen combinations. This audio-visually fused framework enables deploying robust multi-speaker speech recognition without requiring per-speaker training.
کلیدواژه ها:
نویسندگان
Faramarz Zareian
University of Genova,Computer science, Master Degree, Genova, DIBRIS Departimento di informatica, Bioingegneria, Robotics e Ingegneria dei Sistemi