Dynamic Transformer for Few-shot Instance Segmentation

Authors
Publication date 2022
Book title MM '22
Book subtitle proceedings of the 30th ACM International Conference on Multimedia : October 10-14, 2022, Lisboa, Portugal
ISBN (electronic)
  • 9781450392037
Event 30th ACM International Conference on Multimedia
Pages (from-to) 2969–2977
Publisher New York, NY: The Association for Computing Machinery
Organisations
  • Faculty of Science (FNWI) - Informatics Institute (IVI)
Abstract
Few-shot instance segmentation aims to train an instance segmentation model that can fast adapt to novel classes with only a few reference images. Existing methods are usually derived from standard detection models and tackle few-shot instance segmentation indirectly by conducting classification, box regression, and mask prediction on a large set of redundant proposals followed by indispensable post-processing, e.g., Non-Maximum Suppression. Such complicated hand-crafted procedures and hyperparameters lead to degraded optimization and insufficient generalization ability. In this work, we propose an end-to-end Dynamic Transformer Network, DTN for short, to directly segment all target object instances from arbitrary categories given by reference images, relieving the requirements of dense proposal generation and post-processing. Specifically, a small set of Dynamic Queries, conditioned on reference images, are exclusively assigned to target object instances and generate all the instance segmentation masks of reference categories simultaneously. Moreover, a Semantic-induced Transformer Decoder is introduced to constrain the cross-attention between dynamic queries and target images within the pixels of the reference category, which suppresses the noisy interaction with the background and irrelevant categories. Extensive experiments are conducted on the COCO-20 dataset. The experiment results demonstrate that our proposed Dynamic Transformer Network significantly outperforms the state-of-the-arts.
Document type Conference contribution
Note With supplementary video
Language English
Published at https://doi.org/10.1145/3503161.3548227
Permalink to this page
Back