Visual understanding of dynamic scenes using object relationships and open vocabularies
| Authors | |
|---|---|
| Supervisors | |
| Cosupervisors | |
| Award date | 13-01-2026 |
| Number of pages | 134 |
| Organisations |
|
| Abstract |
Humans have a robust and flexible understanding of perceived objects, even under occlusion, unfamiliar appearance or partial information. People also use multiple, context-dependent names for the same item. Conventional computer vision models typically operate in more constrained settings: they assume fixed vocabularies, a predetermined number of objects or detection slots and large quantities of annotated training data. These assumptions limit their ability to recognize (novel) categories under dynamic circumstances, both visually and linguistically. Annotation noise and semantic ambiguity in large-scale datasets further degrade performance and complicate evaluation in fine-grained or open-vocabulary scenarios. Together, these factors create a substantial gap between human visual understanding and the capabilities of today’s models. Humans learn from few examples, generalize across appearances and labels with the help of context, while models remain dependent on training taxonomies and supervision parameters.
This thesis investigates methods to narrow that gap by increasing the adaptability to dynamic environments, scalability and semantic flexibility of perception systems. It explores graph-based and multimodal approaches for edge prediction, detection and semantic or instance segmentation in images, videos and 3D LiDAR point clouds, aiming for systems that better accommodate novel categories, leverage contextual cues and remain robust under real-world variability. Ultimately, the goal is to move beyond closed-world benchmarks towards perception models that operate under fewer assumptions and align more closely with the richness of human visual understanding. |
| Document type | PhD thesis |
| Language | English |
| Downloads |
Thesis (complete)
(Embargo up to 2027-01-13)
Chapter 6: Vocabulary-free online video instance segmentation
(Embargo up to 2027-01-13)
|
| Permalink to this page | |
