Word Representation Learning in Multimodal Pre-Trained Transformers: An Intrinsic Evaluation
| Authors | |
|---|---|
| Publication date | 30-12-2021 |
| Journal | Transactions of the Association of Computational Linguistics |
| Volume | Issue number | 9 |
| Pages (from-to) | 1563–1579 |
| Number of pages | 17 |
| Organisations |
|
| Abstract |
This study carries out a systematic intrinsic evaluation of the
semantic representations learned by state-of-the-art pre-trained
multimodal Transformers. These representations are claimed to be
task-agnostic and shown to help on many downstream language-and-vision
tasks. However, the extent to which they align with human semantic
intuitions remains unclear. We experiment with various models and obtain
static word representations from the contextualized
ones they learn. We then evaluate them against the semantic judgments
provided by human speakers. In line with previous evidence, we observe a
generalized advantage of multimodal representations over language- only
ones on concrete word pairs, but not on abstract ones. On the one hand,
this confirms the effectiveness of these models to align language and
vision, which results in better semantic representations for concepts
that are grounded in images. On the other hand, models are shown to follow different representation learning patterns, which sheds some light on how and when they perform multimodal integration.
|
| Document type | Article |
| Language | English |
| Published at | https://doi.org/10.1162/tacl_a_00443 |
| Downloads |
tacl_a_00443
(Final published version)
|
| Permalink to this page | |
