Adding object detection skills to visual dialogue agents

Open Access
Authors
  • G. Bani
  • D. Belli
  • G. Dagan
  • A. Geenen
Publication date 2019
Host editors
  • L. Leal-Taixé
  • S. Roth
Book title Computer Vision – ECCV 2018 Workshops
Book subtitle Munich, Germany, September 8-14, 2018 : proceedings
ISBN
  • 9783030110178
ISBN (electronic)
  • 9783030110185
Series Lecture Notes in Computer Science
Event 15th European Conference on Computer Vision, Workshops
Volume | Issue number IV
Pages (from-to) 180-187
Publisher Cham: Springer
Organisations
  • Interfacultary Research - Institute for Logic, Language and Computation (ILLC)
Abstract
Our goal is to equip a dialogue agent that asks questions about a visual scene with object detection skills. We take the first steps in this direction within the GuessWhat?! game. We use Mask R-CNN object features as a replacement for ground-truth annotations in the Guesser module, achieving an accuracy of 57.92%. This proves that our system is a viable alternative to the original Guesser, which achieves an accuracy of 62.77% using ground-truth annotations, and thus should be considered an upper bound for our automated system. Crucially, we show that our system exploits the Mask R-CNN object features, in contrast to the original Guesser augmented with global, VGG features. Furthermore, by automating the object detection in GuessWhat?!, we open up a spectrum of opportunities, such as playing the game with new, non-annotated images and using the more granular visual features to condition the other modules of the game architecture.
Document type Conference contribution
Language English
Published at https://doi.org/10.1007/978-3-030-11018-5_17
Published at https://staff.fnwi.uva.nl/r.fernandezrovira/papers/2018/BaniEtal-sivl2018.pdf
Downloads
BaniEtal-sivl2018 (Accepted author manuscript)
Permalink to this page
Back