Early Embedding and Late Reranking for Video Captioning

Authors	J. Dong X. Li W. Lan Y. Huo C.G.M. Snoek
Publication date	2016
Book title	MM’16 : proceedings of the 2016 ACM Multimedia Conference
ISBN	9781450336031
Event	ACM Multimedia Conference
Pages (from-to)	1082-1086
Publisher	New York: Association for Computing Machinery
Organisations	Faculty of Science (FNWI) - Informatics Institute (IVI)
Abstract	This paper describes our solution for the MSR Video to Language Challenge. We start from the popular ConvNet + LSTM model, which we extend with two novel modules. One is early embedding, which enriches the current low-level input to LSTM by tag embeddings. The other is late reranking, for re-scoring generated sentences in terms of their relevance to a specific video. The modules are inspired by recent works on image captioning, repurposed and redesigned for video. As experiments on the MSR-VTT validation set show, the joint use of these two modules add a clear improvement over a non-trivial ConvNet + LSTM baseline under four performance metrics. The viability of the proposed solution is further confirmed by the blind test by the organizers. Our system is ranked at the 4th place in terms of overall performance, while scoring the best CIDEr-D, which measures the human-likeness of generated captions.
Document type	Conference contribution
Language	English
Published at	https://doi.org/10.1145/2964284.2984064
Published at	https://ivi.fnwi.uva.nl/isis/publications/2016/DongICMR2016
Downloads	DongICMR2016 (Final published version)
Permalink to this page

Back

UvA-DARE