Learning from context with multimodal foundation models

Open Access
Authors
Supervisors
Cosupervisors
Award date 21-11-2025
ISBN
  • 9789464964806
Number of pages 127
Organisations
  • Faculty of Science (FNWI) - Informatics Institute (IVI)
Abstract
This thesis investigates how multimodal foundation models can learn from context to enhance understanding, generation, and alignment across vision and language. By leveraging contextual cues across modalities, we introduce methods that improve adaptability and performance in diverse multimodal settings. In the first chapter, we address learning from a few-shot examples with frozen vision and language backbones. We introduce a meta-learning framework that bridges vision and language domains to enable fast adaptation and knowledge transfer across multimodal few-shot tasks. The second chapter focuses on in-context image generation and presents Context Diffusion, a diffusion-based framework that learns directly from visual examples provided in context. Unlike prior approaches that rely heavily on textual prompts, Context Diffusion generates high-quality, contextually faithful images given visual, textual, or combined inputs. In the third chapter, we study contrastive vision-language models such as CLIP and their reliance on a fixed context length. We propose TULIP, a method that incorporates relative position encodings and distills knowledge from the original CLIP text encoder, to enable processing captions of arbitrary length. This leads to significant improvements in long-caption retrieval and image generation tasks. Finally, the last chapter explores long-caption generation, specifically focusing on the generation of medical imaging reports. We introduce variational topic inference, a framework that captures sentence topic diversity, producing coherent, contextually grounded reports aligned with image semantics. Together, these contributions advance learning from context, enabling multimodal foundation models to better understand, generate, and communicate across modalities.
Document type PhD thesis
Language English
Downloads
Permalink to this page
cover
Back