Vision Transformers: An effective method besides CNNs to capture the global context in medical image analysis applications
محل انتشار: اولین کنگره بین المللی هوش مصنوعی در علوم پزشکی
سال انتشار: 1402
نوع سند: مقاله کنفرانسی
زبان: انگلیسی
مشاهده: 140
نسخه کامل این مقاله ارائه نشده است و در دسترس نمی باشد
- صدور گواهی نمایه سازی
- من نویسنده این مقاله هستم
استخراج به نرم افزارهای پژوهشی:
شناسه ملی سند علمی:
AIMS01_177
تاریخ نمایه سازی: 1 مرداد 1402
چکیده مقاله:
Background and aims: CNN has been the prominent choice in many medical applications, includingsegmentation, diagnostic systems, and registration. The main deficiency of these modelsis their local convolutional operators which degrade the accuracy, especially in targets with longrangedependencies. Transformers have evolved as alternate architectures for the sequence-to-sequenceprediction that employ convolution operators and exclusively rely on attention processes.Recently much research has been done to study the combination of CNNs and Transformers inmedical image analysis applications. Accordingly, the aim of the current research is to systematicallyreview recent developments of the CNN-Transformer fusion approach in medical applications.Method: A comprehensive systematic literature search was conducted in electronic databasesincluding PubMed, and Google Scholar for the English language. The chosen search strategywas (“Medical image” OR “image”) AND (“Vision Transformer” OR “ViT”) AND “CNN”. Thisreview paper focuses on the studies in the field of medical image analysis and also the studiesthat have the potential of being considered in medical applications. An overview of the most citedpapers in ViTs with the potential in medical applications from ۲۰۲۱ to ۲۰۲۲ is considered. Toperform an encyclopedic review, any study which combines CNN with ViTs in semantic segmentation,recognition, and registration applications was extracted, explored, and classified in accordancewith the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA).The publications were categorized into three categories based on their applications.Results: We obtained ۱۴۶۰۰ records, of which ۸۰۴۰ records were deleted based on the duplicationand limitation of the publication year in which exponential growth has been observed between۲۰۲۱ and ۲۰۲۲. The full texts of ۳۶ papers were reviewed, and finally, we selected ۲۱ articles forour review process based on some metrics including the fusion strategy, citation, and the potentialin medical applications. The majority of studies on the usage of hybrid ViT-CNN structures fusethe information at the feature level, in which the output tensors of multiple scales from the CNNbasedencoder are fused with the ViT output and the corresponding up-sampled tensors. Semanticsegmentation was used in ten of these studies, recognition in eight and registration in three.Conclusion: Despite their success in medical applications, CNNs perform poorly when it comesto modeling long-range relationships and morphological variations of the target lesion. However,CNNs are better at capturing details. Therefore, fusing CNNs with ViT models has the potentialto extract more diverse features, especially in medical applications. In this study, we conducted asystematic review to evaluate the proposed methods in medical image approaches and categorizedthe presented works based on the fusion strategy.
کلیدواژه ها:
نویسندگان
Amin Amiri Tehrani Zae
Department of Medical Informatics, School of Medicine, Mashhad University of Medical Sciences, Mashhad, Iran
Raheleh Ghouchan Nezhad Noor Nia
Department of Medical Informatics, School of Medicine, Mashhad University of Medical Sciences, Mashhad, Iran
Saeid Eslami
Department of Medical Informatics, School of Medicine, Mashhad University of Medical Sciences, Mashhad, Iran