Text-Enhanced Semantic Segmentation via Contrastive Language-Image Pretraining Guided Multi-Modal Feature Fusion with Feature Refinement Approach
محل انتشار: ماهنامه بین المللی مهندسی، دوره: 39، شماره: 6
سال انتشار: 1405
نوع سند: مقاله ژورنالی
زبان: انگلیسی
مشاهده: 155
فایل این مقاله در 16 صفحه با فرمت PDF قابل دریافت می باشد
- صدور گواهی نمایه سازی
- من نویسنده این مقاله هستم
استخراج به نرم افزارهای پژوهشی:
شناسه ملی سند علمی:
JR_IJE-39-6_011
تاریخ نمایه سازی: 26 شهریور 1404
چکیده مقاله:
Image semantic segmentation is a fundamental task in computer vision. It is the process of pixel labeling and segmenting distinct parts of an image. It has wide applications in various scientific, medical, and industrial fields. Despite significant advancements, detailed and accurate segmentation remains challenging. In this paper, we propose a multi-modal semantic segmentation method to enrich feature maps. By taking both visual and textual features as multi-modal inputs, the model increases feature richness and achieves a more informative and detailed feature representation. In particular, ResNet-۱۰۱ serves as the baseline model for extracting visual features. The Contrastive Language-Image Pretraining (CLIP) text encoder extracts textual features, making the representation multi-modal. Additionally, mid-level features from ResNet-۱۰۱ are used to reduce information loss occurring in deep layers, thereby enhancing the image reconstruction process. The proposed model was tested on the COCO dataset, achieving a mean Intersection over Union (mIoU) of ۶۴.۴۴% on four categories: "person," "chair," "dining table," and "background." The effectiveness of the model is reflected in the proposed method. The proposed method outperforms the baseline DeepLabV۳+ model, achieving a ۷.۸۷% improvement over its mIoU of ۵۶.۵۷%. These results underscore the potential of combining multi-modal image-text data and advanced attention mechanisms to enhance semantic segmentation performance.
کلیدواژه ها:
attention mechanism ، Contrastive Language-Image Pretraining DeepLabV۳+ ، multi-modal ، Semantic segmentation
نویسندگان
M. Gholami
Faculty of Computer Engineering, Shahrood University of Technology, Shahrood, Iran
M. Fateh
Faculty of Computer Engineering, Shahrood University of Technology, Shahrood, Iran
A. Fateh
School of Computer Engineering, Iran University of Science and Technology (IUST), Tehran, Iran
مراجع و منابع این مقاله:
لیست زیر مراجع و منابع استفاده شده در این مقاله را نمایش می دهد. این مراجع به صورت کاملا ماشینی و بر اساس هوش مصنوعی استخراج شده اند و لذا ممکن است دارای اشکالاتی باشند که به مرور زمان دقت استخراج این محتوا افزایش می یابد. مراجعی که مقالات مربوط به آنها در سیویلیکا نمایه شده و پیدا شده اند، به خود مقاله لینک شده اند :