Towards Open-Vocabulary Video Instance Segmentation
| Authors |
|
|---|---|
| Publication date | 2023 |
| Book title | 2023 IEEE/CVF International Conference on Computer Vision |
| Book subtitle | ICCV 2023 : Paris, France, 2-6 October 2023 : proceedings |
| ISBN |
|
| ISBN (electronic) |
|
| Event | 2023 IEEE/CVF International Conference on Computer Vision (ICCV) |
| Pages (from-to) | 4034-4043 |
| Publisher | Los Alamitos, California: IEEE Computer Society |
| Organisations |
|
| Abstract |
Video Instance Segmentation (VIS) aims at segmenting and categorizing objects in videos from a closed set of training categories, lacking the generalization ability to handle novel categories in real-world videos. To address this limitation, we make the following three contributions. First, we introduce the novel task of Open-Vocabulary Video Instance Segmentation, which aims to simultaneously segment, track, and classify objects in videos from open-set categories, including novel categories unseen during training. Second, to benchmark Open-Vocabulary VIS, we collect a Large-Vocabulary Video Instance Segmentation dataset (LV-VIS), that contains well-annotated objects from 1,196 diverse categories, significantly surpassing the category size of existing datasets by more than one order of magnitude. Third, we propose an efficient Memory-Induced Transformer architecture, OV2Seg, to first achieve Open-Vocabulary VIS in an end-to-end manner with near real-time inference speed. Extensive experiments on LV-VIS and four existing VIS datasets demonstrate the strong zero-shot generalization ability of OV2Seg on novel categories. The dataset and code are released here https://github.com/haochenheheda/LVVIS.
|
| Document type | Conference contribution |
| Note | With supplemental file |
| Language | English |
| Published at | https://doi.org/10.48550/arXiv.2304.01715 https://doi.org/10.1109/ICCV51070.2023.00375 |
| Published at | https://openaccess.thecvf.com/content/ICCV2023/html/Wang_Towards_Open-Vocabulary_Video_Instance_Segmentation_ICCV_2023_paper.html |
| Other links | https://github.com/haochenheheda/LVVIS https://www.proceedings.com/72328.html |
| Downloads |
Wang_Towards_Open-Vocabulary_Video_Instance_Segmentation_ICCV_2023_paper
(Accepted author manuscript)
|
| Supplementary materials | |
| Permalink to this page | |
