Learning from context with multimodal foundation models

I. Najdenkoska

Learning from context with multimodal foundation models

Authors	I. Najdenkoska
Supervisors	M. Worring
Cosupervisors	Y. Asano
Award date	21-11-2025
ISBN	9789464964806
Number of pages	127
Organisations	Faculty of Science (FNWI) - Informatics Institute (IVI)
Abstract	This thesis investigates how multimodal foundation models can learn from context to enhance understanding, generation, and alignment across vision and language. By leveraging contextual cues across modalities, we introduce methods that improve adaptability and performance in diverse multimodal settings. In the first chapter, we address learning from a few-shot examples with frozen vision and language backbones. We introduce a meta-learning framework that bridges vision and language domains to enable fast adaptation and knowledge transfer across multimodal few-shot tasks. The second chapter focuses on in-context image generation and presents Context Diffusion, a diffusion-based framework that learns directly from visual examples provided in context. Unlike prior approaches that rely heavily on textual prompts, Context Diffusion generates high-quality, contextually faithful images given visual, textual, or combined inputs. In the third chapter, we study contrastive vision-language models such as CLIP and their reliance on a fixed context length. We propose TULIP, a method that incorporates relative position encodings and distills knowledge from the original CLIP text encoder, to enable processing captions of arbitrary length. This leads to significant improvements in long-caption retrieval and image generation tasks. Finally, the last chapter explores long-caption generation, specifically focusing on the generation of medical imaging reports. We introduce variational topic inference, a framework that captures sentence topic diversity, producing coherent, contextually grounded reports aligned with image semantics. Together, these contributions advance learning from context, enabling multimodal foundation models to better understand, generate, and communicate across modalities.
Document type	PhD thesis
Language	English
Downloads	Thesis
Permalink to this page

Back

UvA-DARE

Digital Academic Repository

Learning from context with multimodal foundation models