Text-Enhanced Semantic Segmentation via Contrastive Language-Image Pretraining Guided Multi-Modal Feature Fusion with Feature Refinement Approach

M. Gholami; M. Fateh; A. Fateh

Text-Enhanced Semantic Segmentation via Contrastive Language-Image Pretraining Guided Multi-Modal Feature Fusion with Feature Refinement Approach

محل انتشار: ماهنامه بین المللی مهندسی، دوره: 39، شماره: 6

سال انتشار: 1405

نوع سند: مقاله ژورنالی

زبان: انگلیسی

مشاهده: 155

فایل این مقاله در 16 صفحه با فرمت PDF قابل دریافت می باشد

دریافت فایل کامل مقاله

صدور گواهی نمایه سازی
من نویسنده این مقاله هستم

استخراج به نرم افزارهای پژوهشی:

لینک ثابت به این مقاله:

https://civilica.com/doc/2369496

شناسه ملی سند علمی:

JR_IJE-39-6_011

تاریخ نمایه سازی: 26 شهریور 1404

چکیده مقاله:

Image semantic segmentation is a fundamental task in computer vision. It is the process of pixel labeling and segmenting distinct parts of an image. It has wide applications in various scientific, medical, and industrial fields. Despite significant advancements, detailed and accurate segmentation remains challenging. In this paper, we propose a multi-modal semantic segmentation method to enrich feature maps. By taking both visual and textual features as multi-modal inputs, the model increases feature richness and achieves a more informative and detailed feature representation. In particular, ResNet-۱۰۱ serves as the baseline model for extracting visual features. The Contrastive Language-Image Pretraining (CLIP) text encoder extracts textual features, making the representation multi-modal. Additionally, mid-level features from ResNet-۱۰۱ are used to reduce information loss occurring in deep layers, thereby enhancing the image reconstruction process. The proposed model was tested on the COCO dataset, achieving a mean Intersection over Union (mIoU) of ۶۴.۴۴% on four categories: "person," "chair," "dining table," and "background." The effectiveness of the model is reflected in the proposed method. The proposed method outperforms the baseline DeepLabV۳+ model, achieving a ۷.۸۷% improvement over its mIoU of ۵۶.۵۷%. These results underscore the potential of combining multi-modal image-text data and advanced attention mechanisms to enhance semantic segmentation performance.

کلیدواژه ها:

attention mechanism ، Contrastive Language-Image Pretraining DeepLabV۳+ ، multi-modal ، Semantic segmentation

نویسندگان

M. Gholami

Faculty of Computer Engineering, Shahrood University of Technology, Shahrood, Iran

M. Fateh

Faculty of Computer Engineering, Shahrood University of Technology, Shahrood, Iran

A. Fateh

School of Computer Engineering, Iran University of Science and Technology (IUST), Tehran, Iran

مراجع و منابع این مقاله:

لیست زیر مراجع و منابع استفاده شده در این مقاله را نمایش می دهد. این مراجع به صورت کاملا ماشینی و بر اساس هوش مصنوعی استخراج شده اند و لذا ممکن است دارای اشکالاتی باشند که به مرور زمان دقت استخراج این محتوا افزایش می یابد. مراجعی که مقالات مربوط به آنها در سیویلیکا نمایه شده و پیدا شده اند، به خود مقاله لینک شده اند :

Sajjanar R, Dixit UD. Enhanced Segmentation of High-Grade and Low-Grade ...
Yu Y, Wang C, Fu Q, Kou R, Huang F, ...
Alokasi H, Ahmad MB. Deep learning-based frameworks for semantic segmentation ...
Badrinarayanan V, Kendall A, Cipolla R. Segnet: A deep convolutional ...
Ronneberger O, Fischer P, Brox T, editors. UNet: Convolutional Networks ...
Wang J, Zhang X, Yan T, Tan A. Dpnet: Dual-pyramid ...
Wu L, Xiao J, Zhang Z, editors. Improved Lightweight DeepLabv۳+ ...
Wang Y, Geng S, Xie Y, Song S. Image Semantic ...
Fakhim MS, Fateh M, Fateh A, Jalali Y. DA-COVSGNet: Double ...
Liu Z, Lin Y, Cao Y, Hu H, Wei Y, ...
Chen X, Yang H, Wang J, Li Y, Tian Y, ...
Cheng B, Misra I, Schwing AG, Kirillov A, Girdhar R. ...
Xie E, Wang W, Yu Z, Anandkumar A, Alvarez JM, ...
Liu, Qian, al e. Attention based lightweight asymmetric network for ...
Rajamani KT, Rani P, Siebert H, ElagiriRamalingam R, Heinrich MP. ...
Dey A, Biswas S. Shot-ViT: cricket batting shots classification with ...
Reddy P. S, Santhosh C. Multimodal Spatiotemporal Feature Map for ...
Li Y, Wang X, & Fu Y. Deep multi-modal fusion ...
Pemasiri A, Nguyen K, Sridharan S, Fookes C. Multi-modal semantic ...
Ding H, Liu C, Wang S, Jiang X. VLT: Vision-language ...
Farsi H, Noursoleimani S, Mohamadzadeh S, Barati A. Multimodal Biomedical ...
Woo S, et al, editor Cbam: Convolutional block attention module. ...
Radford A, Kim JW, Hallacy C, Ramesh A, Goh G, ...
Song X, Fang X, Meng X, Fang X, Lv M, ...
Selvaraju RR, Cogswell, M., Das, A., Vedantam, R., Parikh, D., ...
Mao A, Mohri M, Zhong Y, editors. Cross-entropy loss functions: ...
Lin TY, Maire M, Belongie S, Hays J, Perona P, ...
Ye Y, Xie Y, Zhang J, Chen Z, Xia Y, ...
Rezvani S, Fateh M, Khosravi. aH. "ABANet: Attention boundary‐aware network ...

نمایش کامل مراجع