<i>If I feel smart, I will do the right thing</i>: Combining Complementary Multimodal Information in Visual Language Models

Y. Bai; S. Pezzelle

If I feel smart, I will do the right thing: Combining Complementary Multimodal Information in Visual Language Models

Authors	Y. Bai S. Pezzelle
Publication date	2025
Host editors	W.E. Zhang X. Dai D. Elliott B. Fang M. Sim H. Zhuang W. Chen
Book title	The Workshop of Evaluation of Multi-Modal Generation : proceedings of the First Workshop of Evaluation of Multi-Modal Generation
Book subtitle	EvalMG 2025 : January, 2025
ISBN (electronic)	9798891762138
Event	1st Workshop of Evaluation of Multi-Modal Generation
Pages (from-to)	24-39
Number of pages	16
Publisher	Stroudsburg, PA: Association for Computational Linguistics
Organisations	Interfacultary Research - Institute for Logic, Language and Computation (ILLC)
Abstract	Generative visual language models (VLMs) have recently shown potential across various downstream language-and-vision tasks. At the same time, it is still an open question whether, and to what extent, these models can properly understand a multimodal context where language and vision provide complementary information—a mechanism routinely in place in human language communication. In this work, we test various VLMs on the task of generating action descriptions consistent with both an image’s visual content and an intention or attitude (not visually grounded) conveyed by a textual prompt. Our results show that BLIP-2 is not far from human performance when the task is framed as a generative multiple-choice problem, while other models struggle. Furthermore, the actions generated by BLIP-2 in an open-ended generative setting are better than those by the competitors; indeed, human annotators judge most of them as plausible continuations for the multimodal context. Our study reveals substantial variability among VLMs in integrating complementary multimodal information, yet BLIP-2 demonstrates promising trends across most evaluations, paving the way for seamless human-computer interaction.
Document type	Conference contribution
Language	English
Published at	https://aclanthology.org/2025.evalmg-1.3/ (Final published version)
Downloads	2025.evalmg-1.3 (Final published version)
Permalink to this page

Back

UvA-DARE

Digital Academic Repository

If I feel smart, I will do the right thing: Combining Complementary Multimodal Information in Visual Language Models