Leveraging Swin Transformer for Local-to-Global Weakly Supervised Semantic Segmentation

Ahmadi, Rozhan; Kasaei, Shohreh

doi:10.22060/eej.2024.23490.5616

دانشگاه صنعتی امیرکبیر

تعداد نشریات	8
تعداد شماره‌ها	419
تعداد مقالات	5,516
تعداد مشاهده مقاله	6,268,252
تعداد دریافت فایل اصل مقاله	5,426,858

	Leveraging Swin Transformer for Local-to-Global Weakly Supervised Semantic Segmentation
AUT Journal of Electrical Engineering
دوره 57، Issue 2 (Special Issue)، 2025، صفحه 333-342 اصل مقاله (905.54 K)
نوع مقاله: Research Article
شناسه دیجیتال (DOI): 10.22060/eej.2024.23490.5616
نویسندگان
Rozhan Ahmadi¹؛ Shohreh Kasaei^* ²
¹Masters of Computer Engineering, Department of Computer Engineering, Sharif University of Technology, Tehran, Iran
²Professor of Artificial Intelligence, Department of Computer Engineering, Sharif University of Technology, Tehran, Iran
چکیده
Recent advancements in Weakly Supervised Semantic Segmentation (WSSS) have highlighted the use of image-level class labels as a form of supervision. Many methods use pseudo-labels from class activation maps (CAMs) to address the limited spatial information in class labels. However, CAMs generated from Convolutional Neural Networks (CNNs) are often led to focus on prominent features, making it difficult to distinguish foreground objects from their backgrounds. While recent studies show that features from Vision Transformers (ViTs) are more effective in capturing the scene layout than CNNs, the use of hierarchical ViTs has not been widely studied in WSSS. This work introduces "SWTformer" and explores the effect of Swin Transformer’s local-to-global view on improving the accuracy of initial seed CAMs. SWTformer-V1 produces CAMs solely based on patch tokens as its input features. SWTformer-V2 enhances this process by integrating a multi-scale feature fusion mechanism and employing a background-aware mechanism that refines the accuracy of localization maps, resulting in better differentiation between objects. Experiments on the Pascal VOC 2012 dataset demonstrate that compared to state-of-the-art models, SWTformer-V1 achieves 0.98% mAP higher in localization accuracy and generates initial localization maps that are 0.82% mIoU higher in accuracy while relying solely on the classification network. SWTformer-V2 enhances the accuracy of the seed CAMs by 5.32% mIoU. Code available at: ttps://github.com/RozhanAhmadi/SWTformer
کلیدواژه‌ها
Weakly Supervised Semantic Segmentation؛ Class Activation Map؛ Hierarchical Vision Transformer؛ Image-level label

آمار تعداد مشاهده مقاله: 141 تعداد دریافت فایل اصل مقاله: 138

سامانه مدیریت نشریات علمی. طراحی و پیاده سازی از سیناوب

پیوندهای مفید

آمار

Leveraging Swin Transformer for Local-to-Global Weakly Supervised Semantic Segmentation