Multimodal learning under visually challenging conditions

Authors	Y. Zhang
Supervisors	C.G.M. Snoek
Cosupervisors	H.R. Doughty
Award date	19-11-2024
Number of pages	178
Organisations	Faculty of Science (FNWI) - Informatics Institute (IVI)
Abstract	While many multimodal machine learning methods have obtained superior accuracy compared to single-sense unimodal approaches, they implicitly assume that the visual modality is always clear. However, their common assumption is easy to falsify since poor vision conditions are common in everyday practice. We find that when the vision conditions are challenging, existing machine learning methods often cannot effectively represent the information from other modalities. Hence, they rely too much on the visual modality, as it is usually reliable and informative in the training data. As a result, they cannot adapt when the visual conditions become poor and start to contain misleading information. Moreover, the multimodal model has never learned to find cross-modal correspondences in visually challenging scenarios. This thesis aims at studying multimodal learning under visually challenging conditions. We investigate each type of change in a separate chapter together with our solutions for more effective multimodal representation learning. Finally, we provide a brief conclusion in the last chapter of the thesis. We hope our journey is able to stimulate more research on multimodal learning under visually challenging conditions.
Document type	PhD thesis
Language	English
Downloads	Thesis
Permalink to this page

Back

UvA-DARE