GPT2MVS: Generative Pre-Trained Transformer-2 for Multi-Modal Video Summarization

J.-H. Huang; L. Murn; M. Mrak; M. Worring

doi:https://doi.org/10.1145/3460426.3463662

GPT2MVS: Generative Pre-Trained Transformer-2 for Multi-Modal Video Summarization

Authors	J.-H. Huang L. Murn M. Mrak M. Worring
Publication date	2021
Book title	ICMR '21
Book subtitle	proceedings of the 2021 International Conference on Multimedia Retrieval : August 21-24, 2021, Taipei, Taiwan
ISBN (electronic)	9781450384636
Event	11th ACM International Conference on Multimedia Retrieval, ICMR 2021
Pages (from-to)	580-589
Publisher	New York, NY: The Association for Computing Machinery
Organisations	Faculty of Science (FNWI) - Informatics Institute (IVI)
Abstract	Traditional video summarization methods generate fixed video representations regardless of user interest. Therefore such methods limit users' expectations in content search and exploration scenarios. Multi-modal video summarization is one of the methods utilized to address this problem. When multi-modal video summarization is used to help video exploration, a text-based query is considered as one of the main drivers of video summary generation, as it is user-defined. Thus, encoding both the text-based query and the video effectively is important for the task of multi-modal video summarization. In this work, a new method is proposed that uses a specialized attention network and contextualized word representations to tackle this task. The proposed model consists of a contextualized video summary controller, multi-modal attention mechanisms, an interactive attention network, and a video summary generator. Based on the evaluation of the existing multi-modal video summarization benchmark, experimental results show that the proposed model is effective with the increase of +5.88% in accuracy and +4.06% increase of F1-score, compared with the state-of-the-art method. https://github.com/Jhhuangkay/GPT2MVS-Generative-Pre-trained-Transformer-2-for-Multi-modal-Video-Summarization.
Document type	Conference contribution
Language	English
Published at	https://doi.org/10.1145/3460426.3463662
Permalink to this page

Back

UvA-DARE

Digital Academic Repository

GPT2MVS: Generative Pre-Trained Transformer-2 for Multi-Modal Video Summarization