Advancing vision and language models through commonsense knowledge, efficient adaptation and transparency

Open Access
Authors
Supervisors
Award date 03-10-2025
ISBN
  • 9789464739220
Series ILLC Dissertation Series, DS-2025-07
Number of pages 222
Organisations
  • Interfacultary Research - Institute for Logic, Language and Computation (ILLC)
  • Faculty of Science (FNWI)
Abstract
The area of multimodal learning has seen substantial advances in recent years. However, many important challenges remain, including integrating external factual knowledge in multimodal models, enabling their fast and efficient adaptation to a new task, reducing negative interference in joint learning and lacking the understanding of the internal mechanisms of the multi-modal systems. This dissertation investigates these challenges from the perspectives of two different modalities: language and vision, as well as their combination.
First, we develop a method to enhance multimodal reasoning by integrating commonsense knowledge into the representations of visual objects in the image. This allows models to infer object functions and contextual relationships, enabling reasoning about complex scenes beyond visual and spatial cues, thereby improving applicability to real-world scenarios.
Second, we introduce a parameter-efficient fine-tuning (PEFT) method that selectively tunes a small subset of parameters, reducing computation and mitigating gradient conflicts while preserving pretrained knowledge. To further address multi-task learning, we propose a sparse training approach that facilitates information sharing across tasks while preventing conflicts, achieving strong results in dense vision prediction tasks.
Finally, we investigate the internal mechanisms of multimodal large language models (MLLMs), revealing a layered process of modality interaction: early transfer of global visual features to text tokens, mid-layer injection of object-level information, and final-stage fusion for prediction. This provides new insights into the spatial and functional dynamics of multimodal integration.
In summary, this dissertation advances vision–language modeling by making them more knowledge-aware, efficient, and interpretable, laying the groundwork for more capable multimodal AI in real-world applications.
Document type PhD thesis
Language English
Downloads
Permalink to this page
cover
Back