ArtRAG: Retrieval-Augmented Generation with Structured Context for Visual Art Understanding

S. Wang; I. Najdenkoska; H. Zhu; S. Rudinac; M. Kackovic; N.M. Wijnberg; M. Worring

doi:https://doi.org/10.1145/3746027.3755673

ArtRAG: Retrieval-Augmented Generation with Structured Context for Visual Art Understanding

Authors	S. Wang I. Najdenkoska H. Zhu S. Rudinac M. Kackovic N.M. Wijnberg M. Worring
Publication date	2025
Book title	MM '25
Book subtitle	Proceedings of the 33rd ACM International Conference on Multimedia : October 27-31, 2025, Dublin Ireland
ISBN (electronic)	9798400720352
Event	33rd ACM International Conference on Multimedia
Pages (from-to)	6700-6709
Publisher	New York, NY: Association for Computing Machinery
Organisations	Faculty of Economics and Business (FEB) - Amsterdam Business School Research Institute (ABS-RI) Faculty of Science (FNWI) - Informatics Institute (IVI) Faculty of Economics and Business (FEB)
Abstract	Visual art understanding requires joint modeling of multiple perspectives and contextual inference rooted in cultural, historical, and stylistic knowledge. Recent multimodal large language models (MLLMs) demonstrate strong performance in generic captioning, primarily based on object recognition and training on large-scale generic data. They struggle in providing captions incorporating the multiple perspectives that fine art demands. In this work, we introduce ArtRAG, a novel training-free framework that integrates structured knowledge into a retrieval-augmented generation (RAG) pipeline for multi-perspective artwork explanation. ArtRAG automatically constructs an Art Context Knowledge Graph (ACKG) from domain-specific textual sources, organizing entities such as artists, themes, movements, and historical events into a rich, interpretable knowledge graph. At inference time, a multi-granular structured context retriever selects semantically and topologically relevant subgraphs to guide explanation generation. This approach enables MLLMs to produce contextually grounded, multi-perspective descriptions. Experiments on the SemArt and Artpedia datasets demonstrate that ArtRAG outperforms existing heavily trained baselines. Human evaluations further confirm ArtRAG's ability to generate coherent, informative, and culturally enriched interpretations of artworks.
Document type	Conference contribution
Language	English
Published at	https://doi.org/10.1145/3746027.3755673 (Final published version)
Downloads	3746027.3755673 (Final published version)
Permalink to this page

Back

UvA-DARE

Digital Academic Repository

ArtRAG: Retrieval-Augmented Generation with Structured Context for Visual Art Understanding