Learning Hierarchical Embedding for Video Instance Segmentation
| Authors |
|
|---|---|
| Publication date | 2021 |
| Book title | MM '21 |
| Book subtitle | Proceedings of the 29th ACM International Conference on Multimedia : October 20-24, 2021, Virtual Event, China |
| ISBN (electronic) |
|
| Event | 29th ACM International Conference on Multimedia, MM 2021 |
| Pages (from-to) | 1884-1892 |
| Publisher | New York, NY: Association for Computing Machinery |
| Organisations |
|
| Abstract |
In this paper, we address video instance segmentation using a new generative model that learns effective representations of the target and background appearance. We propose to exploit hierarchical structural embedding over spatio-temporal space, which is compact, powerful, and flexible in contrast to current tracking-by-detection methods. Specifically, our model segments and tracks instances across space and time in a single forward pass, which is formulated as hierarchical embedding learning. The model is trained to locate the pixels belonging to specific instances over a video clip. We firstly take advantage of a novel mixing function to better fuse spatio-temporal embeddings. Moreover, we introduce normalizing flows to further improve the robustness of the learned appearance embedding, which theoretically extends conventional generative flows to a factorized conditional scheme. Comprehensive experiments on the video instance segmentation benchmark, i.e., YouTube-VIS, demonstrate the effectiveness of the proposed approach. Furthermore, we evaluate our method on an unsupervised video object segmentation dataset to demonstrate its generalizability.
|
| Document type | Conference contribution |
| Note | With supplemental material |
| Language | English |
| Published at | https://doi.org/10.1145/3474085.3475342 |
| Downloads |
3474085.3475342
(Final published version)
|
| Supplementary materials | |
| Permalink to this page | |