Seeing past words: Testing the cross-modal capabilities of pretrained V&L models on counting tasks
| Authors |
|
|---|---|
| Publication date | 2021 |
| Host editors |
|
| Book title | Multimodal Semantic Representations |
| Book subtitle | Proceedings of the First Workshop : IWCS : June 16, 2021 |
| ISBN (electronic) |
|
| Event | 1st Workshop on Multimodal Semantic Representations |
| Pages (from-to) | 32-44 |
| Publisher | Stroudsburg, PA: The Association for Computational Linguistics |
| Organisations |
|
| Abstract |
We investigate the reasoning ability of pretrained vision and language (V&L)models in two tasks that require multimodal integration: (1) discriminating acorrect image-sentence pair from an incorrect one, and (2) counting entities inan image. We evaluate three pretrained V&L models on these tasks: ViLBERT,ViLBERT 12-in-1 and LXMERT, in zero-shot and finetuned settings. Our resultsshow that models solve task (1) very well, as expected, since all models arepretrained on task (1). However, none of the pretrained V&L models is able toadequately solve task (2), our counting probe, and they cannot generalise toout-of-distribution quantities. We propose a number of explanations for thesefindings: LXMERT (and to some extent ViLBERT 12-in-1) show some evidence ofcatastrophic forgetting on task (1). Concerning our results on the countingprobe, we find evidence that all models are impacted by dataset bias, and alsofail to individuate entities in the visual input. While a selling point ofpretrained V&L models is their ability to solve complex tasks, our findingssuggest that understanding their reasoning and grounding capabilities requiresmore targeted investigations on specific phenomena.
|
| Document type | Conference contribution |
| Language | English |
| Published at | https://doi.org/10.48550/arXiv.2012.12352 |
| Published at | https://aclanthology.org/2021.mmsr-1.4 |
| Downloads |
2021.mmsr-1.4
(Final published version)
|
| Permalink to this page | |