Advancing vision and language models through commonsense knowledge, efficient adaptation and transparency

Z. Zhang

Advancing vision and language models through commonsense knowledge, efficient adaptation and transparency

Authors	Z. Zhang
Supervisors	R. Fernández Rovira E.V. Shutova
Award date	03-10-2025
ISBN	9789464739220
Series	ILLC Dissertation Series, DS-2025-07
Number of pages	222
Organisations	Faculty of Science (FNWI) Interfacultary Research - Institute for Logic, Language and Computation (ILLC)
Abstract	The area of multimodal learning has seen substantial advances in recent years. However, many important challenges remain, including integrating external factual knowledge in multimodal models, enabling their fast and efficient adaptation to a new task, reducing negative interference in joint learning and lacking the understanding of the internal mechanisms of the multi-modal systems. This dissertation investigates these challenges from the perspectives of two different modalities: language and vision, as well as their combination. First, we develop a method to enhance multimodal reasoning by integrating commonsense knowledge into the representations of visual objects in the image. This allows models to infer object functions and contextual relationships, enabling reasoning about complex scenes beyond visual and spatial cues, thereby improving applicability to real-world scenarios. Second, we introduce a parameter-efficient fine-tuning (PEFT) method that selectively tunes a small subset of parameters, reducing computation and mitigating gradient conflicts while preserving pretrained knowledge. To further address multi-task learning, we propose a sparse training approach that facilitates information sharing across tasks while preventing conflicts, achieving strong results in dense vision prediction tasks. Finally, we investigate the internal mechanisms of multimodal large language models (MLLMs), revealing a layered process of modality interaction: early transfer of global visual features to text tokens, mid-layer injection of object-level information, and final-stage fusion for prediction. This provides new insights into the spatial and functional dynamics of multimodal integration. In summary, this dissertation advances vision–language modeling by making them more knowledge-aware, efficient, and interpretable, laying the groundwork for more capable multimodal AI in real-world applications.
Document type	PhD thesis
Language	English
Downloads	Thesis
Permalink to this page

Back

UvA-DARE

Digital Academic Repository

Advancing vision and language models through commonsense knowledge, efficient adaptation and transparency