Early Embedding and Late Reranking for Video Captioning

Open Access
Authors
  • J. Dong
  • X. Li
  • W. Lan
  • Y. Huo
Publication date 2016
Book title MM’16 : proceedings of the 2016 ACM Multimedia Conference
ISBN
  • 9781450336031
Event ACM Multimedia Conference
Pages (from-to) 1082-1086
Publisher New York: Association for Computing Machinery
Organisations
  • Faculty of Science (FNWI) - Informatics Institute (IVI)
Abstract
This paper describes our solution for the MSR Video to Language Challenge. We start from the popular ConvNet + LSTM model, which we extend with two novel modules. One is early embedding, which enriches the current low-level input to LSTM by tag embeddings. The other is late reranking, for re-scoring generated sentences in terms of their relevance to a specific video. The modules are inspired by recent works on image captioning, repurposed and redesigned for video. As experiments on the MSR-VTT validation set show, the joint use of these two modules add a clear improvement over a non-trivial ConvNet + LSTM baseline under four performance metrics. The viability of the proposed solution is further confirmed by the blind test by the organizers. Our system is ranked at the 4th place in terms of overall performance, while scoring the best CIDEr-D, which measures the human-likeness of generated captions.
Document type Conference contribution
Language English
Published at https://doi.org/10.1145/2964284.2984064
Published at https://ivi.fnwi.uva.nl/isis/publications/2016/DongICMR2016
Downloads
DongICMR2016 (Final published version)
Permalink to this page
Back