ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2026 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2012.12352
  4. Cited By
Seeing past words: Testing the cross-modal capabilities of pretrained
  V&L models on counting tasks
v1v2v3v4 (latest)

Seeing past words: Testing the cross-modal capabilities of pretrained V&L models on counting tasks

22 December 2020
Letitia Parcalabescu
Albert Gatt
Anette Frank
Iacer Calixto
    LRM
ArXiv (abs)PDFHTML

Papers citing "Seeing past words: Testing the cross-modal capabilities of pretrained V&L models on counting tasks"

22 / 22 papers shown
What's Missing in Vision-Language Models? Probing Their Struggles with Causal Order Reasoning
What's Missing in Vision-Language Models? Probing Their Struggles with Causal Order Reasoning
Zhaotian Weng
Haoxuan Li
Kuan-Hao Huang
Jieyu Zhao
Jieyu Zhao
LRMCoGe
253
0
0
01 Jun 2025
VAQUUM: Are Vague Quantifiers Grounded in Visual Data?
VAQUUM: Are Vague Quantifiers Grounded in Visual Data?Annual Meeting of the Association for Computational Linguistics (ACL), 2025
Hugh Mee Wong
Rick Nouwen
Albert Gatt
521
3
0
17 Feb 2025
CV-Probes: Studying the interplay of lexical and world knowledge in visually grounded verb understanding
CV-Probes: Studying the interplay of lexical and world knowledge in visually grounded verb understanding
Ivana Beňová
Michal Gregor
Albert Gatt
424
1
0
02 Sep 2024
What Do VLMs NOTICE? A Mechanistic Interpretability Pipeline for Gaussian-Noise-free Text-Image Corruption and Evaluation
What Do VLMs NOTICE? A Mechanistic Interpretability Pipeline for Gaussian-Noise-free Text-Image Corruption and Evaluation
Michal Golovanevsky
William Rudman
Vedant Palit
Ritambhara Singh
Carsten Eickhoff
495
14
0
24 Jun 2024
ColorFoil: Investigating Color Blindness in Large Vision and Language Models
ColorFoil: Investigating Color Blindness in Large Vision and Language Models
Ahnaf Mozib Samin
M. F. Ahmed
Md. Mushtaq Shahriyar Rafee
VLM
380
8
0
19 May 2024
Beyond Image-Text Matching: Verb Understanding in Multimodal
  Transformers Using Guided Masking
Beyond Image-Text Matching: Verb Understanding in Multimodal Transformers Using Guided Masking
Ivana Beňová
Jana Kosecka
Michal Gregor
Martin Tamajka
Marcel Veselý
Marian Simko
235
2
0
29 Jan 2024
The Role of Linguistic Priors in Measuring Compositional Generalization
  of Vision-Language Models
The Role of Linguistic Priors in Measuring Compositional Generalization of Vision-Language Models
Chenwei Wu
Erran L. Li
Stefano Ermon
Patrick Haffner
Rong Ge
Zaiwei Zhang
VLMCoGe
312
3
0
04 Oct 2023
Can Linguistic Knowledge Improve Multimodal Alignment in Vision-Language
  Pretraining?
Can Linguistic Knowledge Improve Multimodal Alignment in Vision-Language Pretraining?
Haiwei Yang
Liang Ding
Jun Rao
Ye Liu
Li Shen
Changxing Ding
308
26
0
24 Aug 2023
Controlling for Stereotypes in Multimodal Language Model Evaluation
Controlling for Stereotypes in Multimodal Language Model EvaluationBlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP (BlackboxNLP), 2023
Manuj Malik
Richard Johansson
352
1
0
03 Feb 2023
MM-SHAP: A Performance-agnostic Metric for Measuring Multimodal
  Contributions in Vision and Language Models & Tasks
MM-SHAP: A Performance-agnostic Metric for Measuring Multimodal Contributions in Vision and Language Models & TasksAnnual Meeting of the Association for Computational Linguistics (ACL), 2022
Letitia Parcalabescu
Anette Frank
260
56
0
15 Dec 2022
Do Vision-and-Language Transformers Learn Grounded Predicate-Noun
  Dependencies?
Do Vision-and-Language Transformers Learn Grounded Predicate-Noun Dependencies?Conference on Empirical Methods in Natural Language Processing (EMNLP), 2022
Mitja Nikolaus
Emmanuelle Salin
Stéphane Ayache
Abdellah Fourtassi
Benoit Favre
177
17
0
21 Oct 2022
Probing Cross-modal Semantics Alignment Capability from the Textual
  Perspective
Probing Cross-modal Semantics Alignment Capability from the Textual PerspectiveConference on Empirical Methods in Natural Language Processing (EMNLP), 2022
Zheng Ma
Shi Zong
Mianzhi Pan
Jianbing Zhang
Shujian Huang
Xinyu Dai
Jiajun Chen
227
5
0
18 Oct 2022
When and why vision-language models behave like bags-of-words, and what
  to do about it?
When and why vision-language models behave like bags-of-words, and what to do about it?International Conference on Learning Representations (ICLR), 2022
Mert Yuksekgonul
Federico Bianchi
Pratyusha Kalluri
Dan Jurafsky
James Zou
VLMCoGe
562
574
0
04 Oct 2022
Winoground: Probing Vision and Language Models for Visio-Linguistic
  Compositionality
Winoground: Probing Vision and Language Models for Visio-Linguistic CompositionalityComputer Vision and Pattern Recognition (CVPR), 2022
Tristan Thrush
Ryan Jiang
Max Bartolo
Amanpreet Singh
Adina Williams
Douwe Kiela
Candace Ross
CoGe
447
557
0
07 Apr 2022
On Explaining Multimodal Hateful Meme Detection Models
On Explaining Multimodal Hateful Meme Detection ModelsThe Web Conference (WWW), 2022
Ming Shan Hee
Roy Ka-wei Lee
Wen-Haw Chong
VLM
320
55
0
04 Apr 2022
Finding Structural Knowledge in Multimodal-BERT
Finding Structural Knowledge in Multimodal-BERTAnnual Meeting of the Association for Computational Linguistics (ACL), 2022
Victor Milewski
Miryam de Lhoneux
Marie-Francine Moens
315
12
0
17 Mar 2022
DIME: Fine-grained Interpretations of Multimodal Models via Disentangled
  Local Explanations
DIME: Fine-grained Interpretations of Multimodal Models via Disentangled Local ExplanationsAAAI/ACM Conference on AI, Ethics, and Society (AIES), 2022
Yiwei Lyu
Paul Pu Liang
Zihao Deng
Ruslan Salakhutdinov
Louis-Philippe Morency
284
53
0
03 Mar 2022
VALSE: A Task-Independent Benchmark for Vision and Language Models
  Centered on Linguistic Phenomena
VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic Phenomena
Letitia Parcalabescu
Michele Cafagna
Lilitta Muradjan
Anette Frank
Iacer Calixto
Albert Gatt
CoGe
345
140
0
14 Dec 2021
TraVLR: Now You See It, Now You Don't! A Bimodal Dataset for Evaluating
  Visio-Linguistic Reasoning
TraVLR: Now You See It, Now You Don't! A Bimodal Dataset for Evaluating Visio-Linguistic ReasoningConference of the European Chapter of the Association for Computational Linguistics (EACL), 2021
Keng Ji Chow
Samson Tan
MingSung Kan
LRM
298
5
0
21 Nov 2021
Recent Advances of Continual Learning in Computer Vision: An Overview
Recent Advances of Continual Learning in Computer Vision: An OverviewIET Computer Vision (ICV), 2021
Haoxuan Qu
Hossein Rahmani
Kepeng Xu
Bryan M. Williams
Jun Liu
VLMCLL
612
103
0
23 Sep 2021
What Vision-Language Models `See' when they See Scenes
What Vision-Language Models `See' when they See Scenes
Michele Cafagna
Kees van Deemter
Albert Gatt
VLM
327
13
0
15 Sep 2021
Vision-and-Language or Vision-for-Language? On Cross-Modal Influence in
  Multimodal Transformers
Vision-and-Language or Vision-for-Language? On Cross-Modal Influence in Multimodal TransformersConference on Empirical Methods in Natural Language Processing (EMNLP), 2021
Stella Frank
Emanuele Bugliarello
Desmond Elliott
235
97
0
09 Sep 2021
1
Page 1 of 1