Seeing past words: Testing the cross-modal capabilities of pretrained V&amp;L models on counting tasks

L. Parcalabescu; A. Gatt; A. Frank; I. Calixto

doi:https://doi.org/10.48550/arXiv.2012.12352

Seeing past words: Testing the cross-modal capabilities of pretrained V&L models on counting tasks

Authors	L. Parcalabescu A. Gatt A. Frank I. Calixto
Publication date	2021
Host editors	L. Donatelli N. Krishnaswamy K. Lai J. Pustejovsky
Book title	Multimodal Semantic Representations
Book subtitle	Proceedings of the First Workshop : IWCS : June 16, 2021
ISBN (electronic)	9781954085213
Event	1st Workshop on Multimodal Semantic Representations
Pages (from-to)	32-44
Publisher	Stroudsburg, PA: The Association for Computational Linguistics
Organisations	Interfacultary Research - Institute for Logic, Language and Computation (ILLC)
Abstract	We investigate the reasoning ability of pretrained vision and language (V&L)models in two tasks that require multimodal integration: (1) discriminating acorrect image-sentence pair from an incorrect one, and (2) counting entities inan image. We evaluate three pretrained V&L models on these tasks: ViLBERT,ViLBERT 12-in-1 and LXMERT, in zero-shot and finetuned settings. Our resultsshow that models solve task (1) very well, as expected, since all models arepretrained on task (1). However, none of the pretrained V&L models is able toadequately solve task (2), our counting probe, and they cannot generalise toout-of-distribution quantities. We propose a number of explanations for thesefindings: LXMERT (and to some extent ViLBERT 12-in-1) show some evidence ofcatastrophic forgetting on task (1). Concerning our results on the countingprobe, we find evidence that all models are impacted by dataset bias, and alsofail to individuate entities in the visual input. While a selling point ofpretrained V&L models is their ability to solve complex tasks, our findingssuggest that understanding their reasoning and grounding capabilities requiresmore targeted investigations on specific phenomena.
Document type	Conference contribution
Language	English
Published at	https://doi.org/10.48550/arXiv.2012.12352
Published at	https://aclanthology.org/2021.mmsr-1.4
Downloads	2021.mmsr-1.4 (Final published version)
Permalink to this page

Back

UvA-DARE

Digital Academic Repository

Seeing past words: Testing the cross-modal capabilities of pretrained V&L models on counting tasks