Leveraging Swin Transformer for Local-to-Global Weakly Supervised Semantic Segmentation

Ahmadi, Rozhan; Kasaei, Shohreh

doi:10.22060/eej.2024.23490.5616

دانشگاه صنعتی امیرکبیر

تعداد نشریات	8
تعداد شماره‌ها	433
تعداد مقالات	5,622
تعداد مشاهده مقاله	7,377,827
تعداد دریافت فایل اصل مقاله	6,182,431

	Leveraging Swin Transformer for Local-to-Global Weakly Supervised Semantic Segmentation
AUT Journal of Electrical Engineering
دوره 57، Issue 2 (Special Issue)، 2025، صفحه 333-342 اصل مقاله (905.54 K)
نوع مقاله: Research Article
شناسه دیجیتال (DOI): 10.22060/eej.2024.23490.5616
نویسندگان
Rozhan Ahmadi؛ Shohreh Kasaei^*
Department of Computer Engineering, Sharif University of Technology, Tehran, Iran.
چکیده
Recent advancements in Weakly Supervised Semantic Segmentation have highlighted the use of image-level class labels as a form of supervision. Many methods use pseudo-labels from Class Activation Maps to address the limited spatial information in class labels. However, Class Activation Maps generated from Convolutional Neural Networks often led to focus on prominent features, making it difficult to distinguish foreground objects from their backgrounds. While recent studies show that features from Vision Transformers are more effective in capturing the scene layout than Convolutional Neural Networks, the use of hierarchical Vision Transformers has not been widely studied in Weakly Supervised Semantic Segmentation. This work introduces "SWTformer" and explores the effect of Swin Transformer’s local-to-global view on improving the accuracy of initial seed Class Activation Maps. SWTformer-V1 produces Class Activation Maps solely based on patch tokens as its input features. SWTformer-V2 enhances this process by integrating a multi-scale feature fusion mechanism and employing a background-aware mechanism that refines the accuracy of localization maps, resulting in better differentiation between objects. Experiments on the Pascal VOC 2012 dataset demonstrate that compared to state-of-the-art models, SWTformer-V1 achieves 0.98% mAP higher in localization accuracy and generates initial localization maps that are 0.82% mIoU higher in accuracy while relying solely on the classification network. SWTformer-V2 enhances the accuracy of the seed Class Activation Maps by 5.32% mIoU. Code available at: https://github.com/RozhanAhmadi/SWTformer
کلیدواژه‌ها
Weakly Supervised Semantic Segmentation؛ Class Activation Map؛ Hierarchical Vision Transformer؛ Image-level label

مراجع
[1] K.H. J. Dai, and J. Sun, Boxsup: Exploiting bounding boxes to supervise convolutional networks for semantic segmentation, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2015, pp. 1635–1643. [2] J.D. D. Lin, J. Jia, K. He, and J. Sun, Scribblesup: Scribble-supervised convolutional networks for semantic segmentation, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2016, pp. 3159–3167. [3] L.a.Z. Wu, Zhun and Fang, Leyuan and He, Xingxin and Liu, Qiang and Ma, Jiayi and Chen, Hao, Sparsely annotated semantic segmentation with adaptive gaussian mixtures, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 15454--15464. [4] O.R. A. Bearman, V. Ferrari, and L. Fei-Fei, What’s the point: Semantic segmentation with point supervision, in: Proceedings of the European conference on computer vision, 2016, pp. 549–565. [5] A.K. B. Zhou, A. Lapedriza, A. Oliva, and A. Torralba, Learning deep features for discriminative localization, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2016, pp. 2921–2929. [6] Y.L. Z. Liu, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B.Guo, Swin transformer: Hierarchical vision transformer using shifted windows, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 10012–10022. [7] S.A.a.W. Zuidema, Quantifying attention flow in transformers, in: arXiv preprint arXiv:2005.00928, 2020. [8] L.B. A. Dosovitskiy, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, An image is worth 16x16 words: Transformers for image recognition at scale, in: International Conference on Learning Representations, 2021. [9] M.C. H. Touvron, M. Douze, F. Massa, A. Sablayrolles, and H. Jegou, Training data-efficient image transformers & distillation through attention, in: International conference on machine learning, 2021, pp. 10347–10357. [10] Z.a.H. A. Peng, Wei and Gu, Shanzhi and Xie, Lingxi and Wang, Yaowei and Jiao, Jianbin and Ye, Qixiang, Conformer: Local features coupling global representations for visual recognition, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 367-376. [11] Y.C. L. Yuan, T. Wang, W. Yu, Y. Shi, Z.-H. Jiang, F. E. Tay, J. Feng, and S. Yan, Tokens-to-token vit: Training vision transformers from scratch on imagenet, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 558–567. [12] E.X. W. Wang, X. Li, D.-P. Fan, K. Song, D. Liang, T. Lu, P. Luo, and L. Shao, Pyramid vision transformer: A versatile backbone for dense prediction without convolutions, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 568–578. [13] J.A.a.S. Kwak, Learning pixel-level semantic affinity with imagelevel supervision for weakly supervised semantic segmentation, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018, pp. 4981–4990. [14] S.C. J. Ahn, and S. Kwak, Weakly supervised learning of instance segmentation with inter-pixel relations, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 2209–2218. [15] J.F. Y. Wei, X. Liang, M.-M. Cheng, Y. Zhao, and S. Yan, Object region mining with adversarial erasing: A simple classification to semantic segmentation approach, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2017, pp. 1568–1576. [16] E.K. J. Lee, S. Lee, J. Lee, and S. Yoon, Ficklenet: Weakly and semisupervised semantic image segmentation using stochastic inference, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 5267–5276. [17] T. Chen, X. Jiang, G. Pei, Z. Sun, Y. Wang, Y. Yao, Knowledge Transfer with Simulated Inter-Image Erasing for Weakly Supervised Semantic Segmentation, in: Proceedings of the European conference on computer vision, Springer, 2025, pp. 441-458. [18] M.Z. T. Zhou, F. Zhao, and J. Li, Regional semantic contrast and aggregation for weakly supervised semantic segmentation, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 4299– 4309. [19] L.a.Z. Wu, Zhun and Ma, Jiayi and Wei, Yunchao and Chen, Hao and Fang, Leyuan and Li, Shutao, Modeling the Label Distributions for Weakly-Supervised Semantic Segmentation, in: arXiv preprint arXiv:2403.13225, 2024. [20] J. Fan, Z. Zhang, T. Tan, C. Song, J. Xiao, Cian: Cross-image affinity net for weakly supervised semantic segmentation, in: Proceedings of the AAAI conference on artificial intelligence, 2020, pp. 10762-10769. [21] G. Sun, W. Wang, J. Dai, L. Van Gool, Mining cross-image semantics for weakly supervised semantic segmentation, in: Proceedings of the European conference on computer vision, Springer, 2020, pp. 347-365. [22] Z.F. Y. Du, Q. Liu, and Y. Wang, Weakly supervised semantic segmentation by pixel-to-prototype contrast, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 4320–4329. [23] L.Y. Q. Chen, J.-H. Lai, and X. Xie, Self-supervised image-specific prototype exploration for weakly supervised semantic segmentation, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 4288–4298. [24] F. Tang, Z. Xu, Z. Qu, W. Feng, X. Jiang, Z. Ge, Hunting Attributes: Context Prototype-Aware Learning for Weakly Supervised Semantic Segmentation, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 3324-3334. [25] J.H. T. Wu, G. Gao, X. Wei, X. Wei, X. Luo, and C. H. Liu, Embedded discriminative attention mechanism for weakly supervised semantic segmentation, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 16765–16774. [26] J.Z. Y. Wang, M. Kan, S. Shan, and X. Chen, Self-supervised equivariant attention mechanism for weakly supervised semantic segmentation, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 12275–12284. [27] Q.W. Y.-T. Chang, W.-C. Hung, R. Piramuthu, Y.-H. Tsai, and M.H. Yang, Weakly-supervised semantic segmentation via sub-category exploration, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 8991–9000. [28] M.C. Y. Lin, W. Wang, B. Wu, K. Li, B. Lin, H. Liu, and X. He, Clip is also an efficient segmenter: A text-driven approach for weakly supervised semantic segmentation, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 15305–15314. [29] S.a.Z. Deng, Wei and Xie, Jinheng and Shen, Linlin, Qa-clims: Question-answer cross language image matching for weakly supervised semantic segmentation, in: Proceedings of the 31st ACM International Conference on Multimedia, 2023, pp. 5572--5583. [30] B.a.H. Murugesan, Rukhshanda and Bhattacharya, Rajarshi and Ben Ayed, Ismail and Dolz, Jose, Prompting classes: exploring the power of prompt class learning in weakly supervised semantic segmentation, in: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2024, pp. 291--302. [31] B.a.Y. Zhang, Siyue and Wei, Yunchao and Zhao, Yao and Xiao, Jimin, Frozen CLIP: A Strong Backbone for Weakly Supervised Semantic Segmentation, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 3796--3806. [32] L.a.O. Xu, Wanli and Bennamoun, Mohammed and Boussaid, Farid and Xu, Dan, Learning multi-modal class-specific tokens for weakly supervised dense object localization, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 19596--19605. [33] L. Zhu, X. Wang, J. Feng, T. Cheng, Y. Li, B. Jiang, D. Zhang, J. Han, WeakCLIP: Adapting CLIP for Weakly-Supervised Semantic Segmentation, International Journal of Computer Vision, (2024) 1-21. [34] X.H. J. Xie, K. Ye, and L. Shen, Clims: Cross language image matching for weakly supervised semantic segmentation, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 483–4492. [35] J.W.K. A. Radford, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al, Learning transferable visual models from natural language supervision, in: Proceedings of the 38th International Conference on Machine Learning(PMLR), 2021, pp. 8748–8763. [36] E.M. A. Kirillov, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y. Lo, et al, Segment anything, in: arXiv preprint arXiv:2304.02643, 2023, pp. 1–30. [37] Z.L. W. Sun, Y. Zhang, Y. Zhong, and N. Barnes, An alternative to wsss? an empirical study of the segment anything model (sam) on weakly-supervised semantic segmentation problems, in: arXiv preprint arXiv:2305.01586, 2023. [38] P.-T.J.a.Y. Yang, Segment anything is a good pseudo-label generator for weakly supervised semantic segmentation, in: arXiv preprint arXiv:2305.01275, 2023. [39] Z.M. T. Chen, R. Li, and W.-l. Chao, Segment anything model (sam) enhanced pseudo labels for weakly supervised semantic segmentation, in: arXiv preprint arXiv:2305.05803, 2023. [40] H.a.Y. Kweon, Kuk-Jin, From SAM to CAMs: Exploring Segment Anything Model for Weakly Supervised Semantic Segmentation, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 19499--19509. [41] Z.a.F. Yang, Kexue and Duan, Minghong and Qu, Linhao and Wang, Shuo and Song, Zhijian, Separate and conquer: Decoupling co-occurrence via decomposition and representation for weakly supervised semantic segmentation, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 3606--3615. [42] Y.Z. L. Ru, B. Yu, and B. Du, Learning affinity from attention: End-to-end weakly-supervised semantic segmentation with transformers, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 16846–16855. [43] W.O. L. Xu, M. Bennamoun, F. Boussaid, and D. Xu, Multiclass token transformer for weakly supervised semantic segmentation, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 4310–4319. [44] D.Z. S. Rossetti, M. Sanzari, M. Schaerf, and F. Pirri, Max pooling with vision transformers reconciles class and shape in weakly supervised semantic segmentation, in: Proceedings of the European conference on computer vision, 2022, pp. 446–463. [45] H.Z. L. Ru, Y. Zhan, and B. Du, Token contrast for weakly-supervised semantic segmentation, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 3093–3102. [46] Z.M. R. Li, Z. Zhang, J. Jang, and S. Sanner, Transcam: Transformer attentionbased cam refinement for weakly supervised semantic segmentation, in: Elsevier Journal of Visual Communication and Image Representation, 2023, pp. 103800. [47] S.-H.a.K. Yoon, Hoyong and Kim, Hyeonseong and Yoon, Kuk-Jin, Class Tokens Infusion for Weakly Supervised Semantic Segmentation, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 3595--3605. [48] Y.a.Y. Wu, Xichen and Yang, Kequan and Li, Jide and Li, Xiaoqiang, DuPL: Dual Student with Trustworthy Progressive Learning for Robust Weakly Supervised Semantic Segmentation, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 3534--3543. [49] H.T. Mathilde Caron, Ishan Misra, Herve Jegou, Julien Mairal, Piotr Bojanowski, and Armand Joulin, Emerging properties in self-supervised vision transformers, in: Proceedings of the European conference on computer vision, 2021. [50] P.D. T.-Y. Lin, R. Girshick, K. He, B. Hariharan, and S. Belongie, Feature pyramid networks for object detection, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2017, pp. 2117–2125. [51] T.C.a.L. Mo, Swin-fusion: swin-transformer with feature fusion for human action recognition, in: Springer Neural Processing Letters, 2023, pp. 11109–11130. [52] L.V.G. M. Everingham, C. K. Williams, J. Winn, and A. Zisserman, Andrew, The pascal visual object classes (voc) challenge, in: Springer International journal of computer vision, 2010, pp. 303–338. [53] P.A. B. Hariharan, L. Bourdev, S. Maji, and J. Malik, Semantic contours from inverse detectors, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2011, pp. 991–998. [54] W.D. J. Deng, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, Imagenet: A largescale hierarchical image database, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2009, pp. 248–255. [55] E.K. J. Lee, and S. Yoon, Anti-adversarially manipulated attributions for weakly and semi-supervised semantic segmentation, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 4071–4080.
آمار تعداد مشاهده مقاله: 619 تعداد دریافت فایل اصل مقاله: 323

سامانه مدیریت نشریات علمی. طراحی و پیاده سازی از سیناوب

پیوندهای مفید

آمار

Leveraging Swin Transformer for Local-to-Global Weakly Supervised Semantic Segmentation