تعداد نشریات | 7 |
تعداد شمارهها | 399 |
تعداد مقالات | 5,389 |
تعداد مشاهده مقاله | 5,288,180 |
تعداد دریافت فایل اصل مقاله | 4,882,930 |
Leveraging Swin Transformer for Local-to-Global Weakly Supervised Semantic Segmentation | ||
AUT Journal of Electrical Engineering | ||
مقالات آماده انتشار، پذیرفته شده، انتشار آنلاین از تاریخ 12 آذر 1403 اصل مقاله (857.33 K) | ||
نوع مقاله: Research Article | ||
شناسه دیجیتال (DOI): 10.22060/eej.2024.23490.5616 | ||
نویسندگان | ||
Rozhan Ahmadi1؛ Shohreh Kasaei* 2 | ||
1Masters of Computer Engineering, Department of Computer Engineering, Sharif University of Technology, Tehran, Iran | ||
2Professor of Artificial Intelligence, Department of Computer Engineering, Sharif University of Technology, Tehran, Iran | ||
چکیده | ||
Recent advancements in Weakly Supervised Semantic Segmentation (WSSS) have highlighted the use of image-level class labels as a form of supervision. Many methods use pseudo-labels from class activation maps (CAMs) to address the limited spatial information in class labels. However, CAMs generated from Convolutional Neural Networks (CNNs) are often led to focus on prominent features, making it difficult to distinguish foreground objects from their backgrounds. While recent studies show that features from Vision Transformers (ViTs) are more effective in capturing the scene layout than CNNs, the use of hierarchical ViTs has not been widely studied in WSSS. This work introduces "SWTformer" and explores the effect of Swin Transformer’s local-to-global view on improving the accuracy of initial seed CAMs. SWTformer-V1 produces CAMs solely based on patch tokens as its input features. SWTformer-V2 enhances this process by integrating a multi-scale feature fusion mechanism and employing a background-aware mechanism that refines the accuracy of localization maps, resulting in better differentiation between objects. Experiments on the Pascal VOC 2012 dataset demonstrate that compared to state-of-the-art models, SWTformer-V1 achieves 0.98% mAP higher in localization accuracy and generates initial localization maps that are 0.82% mIoU higher in accuracy while relying solely on the classification network. SWTformer-V2 enhances the accuracy of the seed CAMs by 5.32% mIoU. Code available at: ttps://github.com/RozhanAhmadi/SWTformer | ||
کلیدواژهها | ||
Weakly Supervised Semantic Segmentation؛ Class Activation Map؛ Hierarchical Vision Transformer؛ Image-level label | ||
آمار تعداد مشاهده مقاله: 31 تعداد دریافت فایل اصل مقاله: 36 |