Human-Object Interaction Detection <i>without</i> Alignment Supervision

M. Kilickaya; A.W.M. Smeulders

Human-Object Interaction Detection without Alignment Supervision

Authors	M. Kilickaya A.W.M. Smeulders
Publication date	2021
Book title	32nd British Machine Vision Conference 2021
Book subtitle	BMVC 2021, Online, November 22-25, 2021
Event	32nd British Machine Vision Conference
Article number	230
Number of pages	12
Publisher	BMVA Press
Organisations	Faculty of Science (FNWI) - Informatics Institute (IVI)
Abstract	The goal of this paper is Human-object Interaction (HO-I) detection. HO-I detection aims to find interacting human-objects regions and classify their interaction from an image. Researchers obtain significant improvement in recent years by relying on strong HO-I alignment supervision from [5]. HO-I alignment supervision pairs humans with their interacted objects, and then aligns human-object pair(s) with their interaction categories. Since collecting such annotation is expensive, in this paper, we propose to detect HO-I without alignment supervision. We instead rely on image-level supervision that only enumerates existing interactions within the image without pointing where they happen. Our paper makes three contributions: i) We propose Align-Former, a visual-transformer based CNN that can detect HO-I with only image-level supervision. ii) Align-Former is equipped with HO-I align layer, that can learn to select appropriate targets to allow detector supervision. iii) We evaluate Align-Former on HICO-DET [5] and V-COCO [13], and show that Align-Former outperforms existing image-level supervised HO-I detectors by a large margin (4.71% mAP improvement from 16:14%→85% on HICO-DET [5]).
Document type	Conference contribution
Note	With supplementary file
Language	English
Other links	https://dblp.org/db/conf/bmvc/bmvc2021.html https://www.bmvc2021-virtualconference.com/programme/accepted-papers/
Downloads	0054 (Final published version)
Supplementary materials	0054_supp
Permalink to this page

Back

UvA-DARE

Digital Academic Repository

Human-Object Interaction Detection without Alignment Supervision