Visual understanding of dynamic scenes using object relationships and open vocabularies

O. Ülger

Visual understanding of dynamic scenes using object relationships and open vocabularies

Authors	O. Ülger
Supervisors	T. Gevers
Cosupervisors	M.R. Oswald
Award date	13-01-2026
Number of pages	134
Organisations	Faculty of Science (FNWI) - Informatics Institute (IVI)
Abstract	Humans have a robust and flexible understanding of perceived objects, even under occlusion, unfamiliar appearance or partial information. People also use multiple, context-dependent names for the same item. Conventional computer vision models typically operate in more constrained settings: they assume fixed vocabularies, a predetermined number of objects or detection slots and large quantities of annotated training data. These assumptions limit their ability to recognize (novel) categories under dynamic circumstances, both visually and linguistically. Annotation noise and semantic ambiguity in large-scale datasets further degrade performance and complicate evaluation in fine-grained or open-vocabulary scenarios. Together, these factors create a substantial gap between human visual understanding and the capabilities of today’s models. Humans learn from few examples, generalize across appearances and labels with the help of context, while models remain dependent on training taxonomies and supervision parameters. This thesis investigates methods to narrow that gap by increasing the adaptability to dynamic environments, scalability and semantic flexibility of perception systems. It explores graph-based and multimodal approaches for edge prediction, detection and semantic or instance segmentation in images, videos and 3D LiDAR point clouds, aiming for systems that better accommodate novel categories, leverage contextual cues and remain robust under real-world variability. Ultimately, the goal is to move beyond closed-world benchmarks towards perception models that operate under fewer assumptions and align more closely with the richness of human visual understanding.
Document type	PhD thesis
Language	English
Downloads	Thesis (complete) (Embargo up to 2027-01-13) Front matter Chapter 1: Introduction Chapter 2: Multi-task edge prediction in temporally-dynamic video graphs Chapter 3: Relational prior knowledge graphs for detection and instance segmentation Chapter 4: Auto-vocabulary semantic segmentation Chapter 5: Auto-covabulary segmentation for LiDAR points Chapter 6: Vocabulary-free online video instance segmentation (Embargo up to 2027-01-13) Chapter 7: Summary & conclusions Chapter 8: Supplementary material Bibliography; Samenvatting; Acknowledgements
Permalink to this page

Back

UvA-DARE

Digital Academic Repository

Visual understanding of dynamic scenes using object relationships and open vocabularies