Cross-modal dynamic convolution for multi-modal emotion recognition

Authors
Publication date 07-2021
Journal Journal of Visual Communication and Image Representation
Article number 103178
Volume | Issue number 78
Number of pages 10
Organisations
  • Faculty of Science (FNWI) - Informatics Institute (IVI)
Abstract
Understanding human emotions requires information from different modalities like vocal, visual, and verbal. Since human emotion is time-varying, the related information is usually represented as temporal sequences and we need to identify both emotion-related clues and their cross-modal interactions inside. However, emotion-related clues are sparse and misaligned in temporally unaligned sequences, making it hard for previous multi-modal emotion recognition methods to catch helpful cross-modal interactions. To this end, we present cross-modal dynamic convolution. To deal with sparsity, cross-modal dynamic convolution models the temporal dimension locally to avoid being overwhelmed by unrelated information. Cross-modal dynamic convolution is easy to stack, enabling it to model long-range cross-modal temporal interactions. Besides, models with cross-modal dynamic convolution are more stable during training than with cross-modal attention, bringing more possibilities in multi-modal sequential model designing. Extensive experiments show that our method can achieve competitive performance compared to previous works while being more efficient.
Document type Article
Language English
Published at https://doi.org/10.1016/j.jvcir.2021.103178
Permalink to this page
Back