Cross-modal dynamic convolution for multi-modal emotion recognition

doi:https://doi.org/10.1016/j.jvcir.2021.103178

Cross-modal dynamic convolution for multi-modal emotion recognition

Authors	H. Wen S. You Y. Fu
Publication date	07-2021
Journal	Journal of Visual Communication and Image Representation
Article number	103178
Volume \| Issue number	78
Number of pages	10
Organisations	Faculty of Science (FNWI) - Informatics Institute (IVI)
Abstract	Understanding human emotions requires information from different modalities like vocal, visual, and verbal. Since human emotion is time-varying, the related information is usually represented as temporal sequences and we need to identify both emotion-related clues and their cross-modal interactions inside. However, emotion-related clues are sparse and misaligned in temporally unaligned sequences, making it hard for previous multi-modal emotion recognition methods to catch helpful cross-modal interactions. To this end, we present cross-modal dynamic convolution. To deal with sparsity, cross-modal dynamic convolution models the temporal dimension locally to avoid being overwhelmed by unrelated information. Cross-modal dynamic convolution is easy to stack, enabling it to model long-range cross-modal temporal interactions. Besides, models with cross-modal dynamic convolution are more stable during training than with cross-modal attention, bringing more possibilities in multi-modal sequential model designing. Extensive experiments show that our method can achieve competitive performance compared to previous works while being more efficient.
Document type	Article
Language	English
Published at	https://doi.org/10.1016/j.jvcir.2021.103178
Permalink to this page

Back

UvA-DARE

Digital Academic Repository

Cross-modal dynamic convolution for multi-modal emotion recognition