ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2026 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2011.15124
  4. Cited By
Multimodal Pretraining Unmasked: A Meta-Analysis and a Unified Framework
  of Vision-and-Language BERTs
v1v2 (latest)

Multimodal Pretraining Unmasked: A Meta-Analysis and a Unified Framework of Vision-and-Language BERTs

Transactions of the Association for Computational Linguistics (TACL), 2020
30 November 2020
Emanuele Bugliarello
Robert Bamler
Naoaki Okazaki
Desmond Elliott
ArXiv (abs)PDFHTML

Papers citing "Multimodal Pretraining Unmasked: A Meta-Analysis and a Unified Framework of Vision-and-Language BERTs"

50 / 69 papers shown
Model Decides How to Tokenize: Adaptive DNA Sequence Tokenization with
  MxDNA
Model Decides How to Tokenize: Adaptive DNA Sequence Tokenization with MxDNANeural Information Processing Systems (NeurIPS), 2024
Lifeng Qiao
Peng Ye
Yuchen Ren
Weiqiang Bai
Chaoqi Liang
Cheng Wang
Nanqing Dong
W. Ouyang
366
16
0
18 Dec 2024
Do Language Models Understand Time?
Do Language Models Understand Time?The Web Conference (WWW), 2024
Xi Ding
Lei Wang
1.0K
13
0
18 Dec 2024
Unified Framework for Open-World Compositional Zero-shot Learning
Unified Framework for Open-World Compositional Zero-shot LearningIEEE Workshop/Winter Conference on Applications of Computer Vision (WACV), 2024
Hirunima Jayasekara
Khoi Pham
Nirat Saini
Abhinav Shrivastava
386
1
0
05 Dec 2024
Renaissance: Investigating the Pretraining of Vision-Language Encoders
Renaissance: Investigating the Pretraining of Vision-Language Encoders
Clayton Fields
C. Kennington
VLM
221
1
0
11 Nov 2024
VISTA: A Visual and Textual Attention Dataset for Interpreting
  Multimodal Models
VISTA: A Visual and Textual Attention Dataset for Interpreting Multimodal Models
Harshit
Tolga Tasdizen
CoGeVLM
200
2
0
06 Oct 2024
Why context matters in VQA and Reasoning: Semantic interventions for VLM
  input modalities
Why context matters in VQA and Reasoning: Semantic interventions for VLM input modalities
Kenza Amara
Lukas Klein
Carsten T. Lüth
Paul Jäger
Hendrik Strobelt
Mennatallah El-Assady
228
3
0
02 Oct 2024
ComAlign: Compositional Alignment in Vision-Language Models
ComAlign: Compositional Alignment in Vision-Language Models
Ali Abdollah
Amirmohammad Izadi
Armin Saghafian
Reza Vahidimajd
Mohammad Mozafari
Amirreza Mirzaei
Mohammadmahdi Samiei
M. Baghshah
CoGeVLM
296
1
0
12 Sep 2024
CV-Probes: Studying the interplay of lexical and world knowledge in visually grounded verb understanding
CV-Probes: Studying the interplay of lexical and world knowledge in visually grounded verb understanding
Ivana Beňová
Michal Gregor
Albert Gatt
433
1
0
02 Sep 2024
BrewCLIP: A Bifurcated Representation Learning Framework for
  Audio-Visual Retrieval
BrewCLIP: A Bifurcated Representation Learning Framework for Audio-Visual Retrieval
Zhenyu Lu
Lakshay Sethi
238
0
0
19 Aug 2024
MuTT: A Multimodal Trajectory Transformer for Robot Skills
MuTT: A Multimodal Trajectory Transformer for Robot Skills
Claudius Kienle
Benjamin Alt
Onur Celik
P. Becker
Darko Katic
Rainer Jäkel
Gerhard Neumann
371
3
0
22 Jul 2024
How and where does CLIP process negation?
How and where does CLIP process negation?
Vincent Quantmeyer
Pablo Mosteiro
Albert Gatt
CoGe
306
14
0
15 Jul 2024
GECOBench: A Gender-Controlled Text Dataset and Benchmark for Quantifying Biases in Explanations
GECOBench: A Gender-Controlled Text Dataset and Benchmark for Quantifying Biases in Explanations
Rick Wilming
Artur Dox
Hjalmar Schulz
Marta Oliveira
Benedict Clark
Stefan Haufe
338
6
0
17 Jun 2024
No Filter: Cultural and Socioeconomic Diversity in Contrastive
  Vision-Language Models
No Filter: Cultural and Socioeconomic Diversity in Contrastive Vision-Language Models
Angeline Pouget
Lucas Beyer
Emanuele Bugliarello
Xiao Wang
Andreas Steiner
Xiao-Qi Zhai
Ibrahim Alabdulmohsin
VLM
421
15
0
22 May 2024
Acquiring Linguistic Knowledge from Multimodal Input
Acquiring Linguistic Knowledge from Multimodal Input
Theodor Amariucai
Alexander Scott Warstadt
CLL
367
4
0
27 Feb 2024
Beyond Image-Text Matching: Verb Understanding in Multimodal
  Transformers Using Guided Masking
Beyond Image-Text Matching: Verb Understanding in Multimodal Transformers Using Guided Masking
Ivana Beňová
Jana Kosecka
Michal Gregor
Martin Tamajka
Marcel Veselý
Marian Simko
240
2
0
29 Jan 2024
GPT-4 Enhanced Multimodal Grounding for Autonomous Driving: Leveraging
  Cross-Modal Attention with Large Language Models
GPT-4 Enhanced Multimodal Grounding for Autonomous Driving: Leveraging Cross-Modal Attention with Large Language Models
Haicheng Liao
Huanming Shen
Zhenning Li
Chengyue Wang
Guofa Li
Yiming Bie
Chengzhong Xu
310
89
0
06 Dec 2023
Evaluating Bias and Fairness in Gender-Neutral Pretrained
  Vision-and-Language Models
Evaluating Bias and Fairness in Gender-Neutral Pretrained Vision-and-Language ModelsConference on Empirical Methods in Natural Language Processing (EMNLP), 2023
Laura Cabello
Emanuele Bugliarello
Stephanie Brandl
Desmond Elliott
350
8
0
26 Oct 2023
The BLA Benchmark: Investigating Basic Language Abilities of Pre-Trained
  Multimodal Models
The BLA Benchmark: Investigating Basic Language Abilities of Pre-Trained Multimodal ModelsConference on Empirical Methods in Natural Language Processing (EMNLP), 2023
Xinyi Chen
Raquel Fernández
Sandro Pezzelle
VLM
263
13
0
23 Oct 2023
On the Language Encoder of Contrastive Cross-modal Models
On the Language Encoder of Contrastive Cross-modal Models
Mengjie Zhao
Junya Ono
Zhi-Wei Zhong
Chieh-Hsin Lai
Yuhta Takida
Naoki Murata
Wei-Hsiang Liao
Takashi Shibuya
Hiromi Wakaki
Yuki Mitsufuji
VLM
169
2
0
20 Oct 2023
A Survey on Image-text Multimodal Models
A Survey on Image-text Multimodal Models
Ruifeng Guo
Jingxuan Wei
Linzhuang Sun
Khai-Nguyen Nguyen
Guiyong Chang
Dawei Liu
Sibo Zhang
Zhengbing Yao
Mingjun Xu
Liping Bu
VLM
411
25
0
23 Sep 2023
The Scenario Refiner: Grounding subjects in images at the morphological
  level
The Scenario Refiner: Grounding subjects in images at the morphological level
Claudia Tagliaferri
Sofia Axioti
Albert Gatt
Denis Paperno
271
1
0
20 Sep 2023
Can Linguistic Knowledge Improve Multimodal Alignment in Vision-Language
  Pretraining?
Can Linguistic Knowledge Improve Multimodal Alignment in Vision-Language Pretraining?
Haiwei Yang
Liang Ding
Jun Rao
Ye Liu
Li Shen
Changxing Ding
318
27
0
24 Aug 2023
Generic Attention-model Explainability by Weighted Relevance
  Accumulation
Generic Attention-model Explainability by Weighted Relevance AccumulationACM Multimedia Asia (MA), 2023
Yiming Huang
Ao Jia
Xiaodan Zhang
Jiawei Zhang
176
4
0
20 Aug 2023
Vision Language Transformers: A Survey
Vision Language Transformers: A Survey
Clayton Fields
C. Kennington
VLM
237
8
0
06 Jul 2023
Switch-BERT: Learning to Model Multimodal Interactions by Switching
  Attention and Input
Switch-BERT: Learning to Model Multimodal Interactions by Switching Attention and InputEuropean Conference on Computer Vision (ECCV), 2023
Qingpei Guo
Kaisheng Yao
Wei Chu
MLLM
124
7
0
25 Jun 2023
Zero-shot Composed Text-Image Retrieval
Zero-shot Composed Text-Image RetrievalBritish Machine Vision Conference (BMVC), 2023
Yikun Liu
Jiangchao Yao
Ya Zhang
Yanfeng Wang
Weidi Xie
315
38
0
12 Jun 2023
Factorized Contrastive Learning: Going Beyond Multi-view Redundancy
Factorized Contrastive Learning: Going Beyond Multi-view RedundancyNeural Information Processing Systems (NeurIPS), 2023
Paul Pu Liang
Zihao Deng
Martin Q. Ma
James Zou
Louis-Philippe Morency
Ruslan Salakhutdinov
SSL
349
100
0
08 Jun 2023
Weakly-Supervised Learning of Visual Relations in Multimodal Pretraining
Weakly-Supervised Learning of Visual Relations in Multimodal PretrainingConference on Empirical Methods in Natural Language Processing (EMNLP), 2023
Emanuele Bugliarello
Aida Nematzadeh
Lisa Anne Hendricks
SSL
344
6
0
23 May 2023
Semantic Composition in Visually Grounded Language Models
Semantic Composition in Visually Grounded Language Models
Rohan Pandey
CoGe
255
1
0
15 May 2023
Measuring Progress in Fine-grained Vision-and-Language Understanding
Measuring Progress in Fine-grained Vision-and-Language UnderstandingAnnual Meeting of the Association for Computational Linguistics (ACL), 2023
Emanuele Bugliarello
Laurent Sartran
Aishwarya Agrawal
Lisa Anne Hendricks
Aida Nematzadeh
VLM
261
31
0
12 May 2023
A Multi-Modal Context Reasoning Approach for Conditional Inference on
  Joint Textual and Visual Clues
A Multi-Modal Context Reasoning Approach for Conditional Inference on Joint Textual and Visual CluesAnnual Meeting of the Association for Computational Linguistics (ACL), 2023
Yunxin Li
Baotian Hu
Xinyu Chen
Yuxin Ding
Lin Ma
Min Zhang
LRM
216
19
0
08 May 2023
Multimodal Understanding Through Correlation Maximization and
  Minimization
Multimodal Understanding Through Correlation Maximization and Minimization
Yi Shi
Marc Niethammer
246
1
0
04 May 2023
3D Feature Prediction for Masked-AutoEncoder-Based Point Cloud
  Pretraining
3D Feature Prediction for Masked-AutoEncoder-Based Point Cloud PretrainingInternational Conference on Learning Representations (ICLR), 2023
Siming Yan
Yu-Qi Yang
Yu-Xiao Guo
Hao Pan
Peng-shuai Wang
Xin Tong
Yang Liu
Qi-Xing Huang
3DPC
285
21
0
14 Apr 2023
A Study of Autoregressive Decoders for Multi-Tasking in Computer Vision
A Study of Autoregressive Decoders for Multi-Tasking in Computer Vision
Lucas Beyer
Bo Wan
Gagan Madan
Filip Pavetić
Andreas Steiner
...
Emanuele Bugliarello
Tianlin Li
Qihang Yu
Liang-Chieh Chen
Xiaohua Zhai
311
10
0
30 Mar 2023
A Two-Sided Discussion of Preregistration of NLP Research
A Two-Sided Discussion of Preregistration of NLP ResearchConference of the European Chapter of the Association for Computational Linguistics (EACL), 2023
Anders Søgaard
Daniel Hershcovich
Miryam de Lhoneux
OnRLAI4CE
239
4
0
20 Feb 2023
BinaryVQA: A Versatile Test Set to Evaluate the Out-of-Distribution
  Generalization of VQA Models
BinaryVQA: A Versatile Test Set to Evaluate the Out-of-Distribution Generalization of VQA Models
Ali Borji
CoGe
153
2
0
28 Jan 2023
Multimodal Inverse Cloze Task for Knowledge-based Visual Question
  Answering
Multimodal Inverse Cloze Task for Knowledge-based Visual Question AnsweringEuropean Conference on Information Retrieval (ECIR), 2023
Paul Lerner
O. Ferret
C. Guinaudeau
303
12
0
11 Jan 2023
Cross-modal Attention Congruence Regularization for Vision-Language
  Relation Alignment
Cross-modal Attention Congruence Regularization for Vision-Language Relation AlignmentAnnual Meeting of the Association for Computational Linguistics (ACL), 2022
Rohan Pandey
Rulin Shao
Paul Pu Liang
Ruslan Salakhutdinov
Louis-Philippe Morency
266
21
0
20 Dec 2022
Compound Tokens: Channel Fusion for Vision-Language Representation
  Learning
Compound Tokens: Channel Fusion for Vision-Language Representation Learning
Maxwell Mbabilla Aladago
A. Piergiovanni
221
3
0
02 Dec 2022
Understanding Cross-modal Interactions in V&L Models that Generate Scene
  Descriptions
Understanding Cross-modal Interactions in V&L Models that Generate Scene Descriptions
Michele Cafagna
Kees van Deemter
Albert Gatt
CoGe
200
4
0
09 Nov 2022
Training Vision-Language Models with Less Bimodal Supervision
Training Vision-Language Models with Less Bimodal SupervisionConference on Automated Knowledge Base Construction (AKBC), 2022
Elad Segal
Ben Bogin
Jonathan Berant
VLM
153
2
0
01 Nov 2022
Learning by Hallucinating: Vision-Language Pre-training with Weak
  Supervision
Learning by Hallucinating: Vision-Language Pre-training with Weak SupervisionIEEE Workshop/Winter Conference on Applications of Computer Vision (WACV), 2022
Tong Wang
Jorma T. Laaksonen
T. Langer
Heikki Arponen
Tom E. Bishop
VLM
208
7
0
24 Oct 2022
Multilingual Multimodal Learning with Machine Translated Text
Multilingual Multimodal Learning with Machine Translated TextConference on Empirical Methods in Natural Language Processing (EMNLP), 2022
Chen Qiu
Dan Oneaţă
Emanuele Bugliarello
Stella Frank
Desmond Elliott
394
19
0
24 Oct 2022
Do Vision-and-Language Transformers Learn Grounded Predicate-Noun
  Dependencies?
Do Vision-and-Language Transformers Learn Grounded Predicate-Noun Dependencies?Conference on Empirical Methods in Natural Language Processing (EMNLP), 2022
Mitja Nikolaus
Emmanuelle Salin
Stéphane Ayache
Abdellah Fourtassi
Benoit Favre
179
17
0
21 Oct 2022
LVP-M3: Language-aware Visual Prompt for Multilingual Multimodal Machine
  Translation
LVP-M3: Language-aware Visual Prompt for Multilingual Multimodal Machine TranslationConference on Empirical Methods in Natural Language Processing (EMNLP), 2022
Hongcheng Guo
Jiaheng Liu
Haoyang Huang
Jian Yang
Zhoujun Li
Dongdong Zhang
Zheng Cui
Furu Wei
230
25
0
19 Oct 2022
One does not fit all! On the Complementarity of Vision Encoders for
  Vision and Language Tasks
One does not fit all! On the Complementarity of Vision Encoders for Vision and Language TasksWorkshop on Representation Learning for NLP (RepL4NLP), 2022
Gregor Geigle
Chen Cecilia Liu
Jonas Pfeiffer
Iryna Gurevych
VLM
235
1
0
12 Oct 2022
How to Adapt Pre-trained Vision-and-Language Models to a Text-only
  Input?
How to Adapt Pre-trained Vision-and-Language Models to a Text-only Input?International Conference on Computational Linguistics (COLING), 2022
Lovisa Hagström
Richard Johansson
VLM
172
4
0
19 Sep 2022
FashionViL: Fashion-Focused Vision-and-Language Representation Learning
FashionViL: Fashion-Focused Vision-and-Language Representation LearningEuropean Conference on Computer Vision (ECCV), 2022
Xiaoping Han
Licheng Yu
Xiatian Zhu
Li Zhang
Yi-Zhe Song
Tao Xiang
AI4TS
226
63
0
17 Jul 2022
Reassessing Evaluation Practices in Visual Question Answering: A Case
  Study on Out-of-Distribution Generalization
Reassessing Evaluation Practices in Visual Question Answering: A Case Study on Out-of-Distribution GeneralizationFindings (Findings), 2022
Aishwarya Agrawal
Ivana Kajić
Emanuele Bugliarello
Elnaz Davoodi
Anita Gergely
Phil Blunsom
Aida Nematzadeh
OOD
281
22
0
24 May 2022
Visual Spatial Reasoning
Visual Spatial ReasoningTransactions of the Association for Computational Linguistics (TACL), 2022
Fangyu Liu
Guy Edward Toh Emerson
Nigel Collier
ReLM
629
301
0
30 Apr 2022
12
Next
Page 1 of 2