Learning Content-Enhanced Mask Transformer for Domain Generalized Urban-Scene Segmentation

Q. Bi; S. You; T. Gevers

doi:https://doi.org/10.1609/aaai.v38i2.27840

Learning Content-Enhanced Mask Transformer for Domain Generalized Urban-Scene Segmentation

Authors	Q. Bi S. You T. Gevers
Publication date	2024
Host editors	M. Wooldridge J. Dy S. Natarajan
Book title	Proceedings of the 38th AAAI Conference on Artificial Intelligence
Book subtitle	AAAI-2024
ISBN	9781577358879
Event	38th AAAI Conference on Artificial Intelligence
Volume \| Issue number	2
Pages (from-to)	819-827
Publisher	Washington, DC: AAAI Press
Organisations	Faculty of Science (FNWI) - Informatics Institute (IVI)
Abstract	Domain-generalized urban-scene semantic segmentation (USSS) aims to learn generalized semantic predictions across diverse urban-scene styles. Unlike generic domain gap challenges, USSS is unique in that the semantic categories are often similar in different urban scenes, while the styles can vary significantly due to changes in urban landscapes, weather conditions, lighting, and other factors. Existing approaches typically rely on convolutional neural networks (CNNs) to learn the content of urban scenes. In this paper, we propose a Content-enhanced Mask TransFormer (CMFormer) for domain-generalized USSS. The main idea is to enhance the focus of the fundamental component, the mask attention mechanism, in Transformer segmentation models on content information. We have observed through empirical analysis that a mask representation effectively captures pixel segments, albeit with reduced robustness to style variations. Conversely, its lower-resolution counterpart exhibits greater ability to accommodate style variations, while being less proficient in representing pixel segments. To harness the synergistic attributes of these two approaches, we introduce a novel content-enhanced mask attention mechanism. It learns mask queries from both the image feature and its down-sampled counterpart, aiming to simultaneously encapsulate the content and address stylistic variations. These features are fused into a Transformer decoder and integrated into a multi-resolution content-enhanced mask attention learning scheme. Extensive experiments conducted on various domain-generalized urban-scene segmentation datasets demonstrate that the proposed CMFormer significantly outperforms existing CNN-based methods by up to 14.0% mIoU and the contemporary HGFormer by up to 1.7% mIoU. The source code is publicly available at https://github.com/BiQiWHU/CMFormer.
Document type	Conference contribution
Language	English
Published at	https://doi.org/10.1609/aaai.v38i2.27840
Downloads	3218 (Accepted author manuscript) Learning Content-Enhanced Mask Transformer for Domain Generalized Urban-Scene Segmentation (Final published version)
Permalink to this page

Back

UvA-DARE

Digital Academic Repository

Learning Content-Enhanced Mask Transformer for Domain Generalized Urban-Scene Segmentation