ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2026 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2102.00529
  4. Cited By
Decoupling the Role of Data, Attention, and Losses in Multimodal
  Transformers

Decoupling the Role of Data, Attention, and Losses in Multimodal Transformers

Transactions of the Association for Computational Linguistics (TACL), 2021
31 January 2021
Lisa Anne Hendricks
John F. J. Mellor
R. Schneider
Jean-Baptiste Alayrac
Aida Nematzadeh
ArXiv (abs)PDFHTML

Papers citing "Decoupling the Role of Data, Attention, and Losses in Multimodal Transformers"

50 / 59 papers shown
Breast Cancer VLMs: Clinically Practical Vision-Language Train-Inference Models
Breast Cancer VLMs: Clinically Practical Vision-Language Train-Inference Models
Shunjie-Fabian Zheng
Hyeonjun Lee
Thijs Kooi
Ali Diba
112
0
0
29 Oct 2025
Learning Reconfigurable Representations for Multimodal Federated Learning with Missing Data
Learning Reconfigurable Representations for Multimodal Federated Learning with Missing Data
D. Nguyen
Trong Nghia Hoang
T. T. Huynh
Quoc Viet Hung Nguyen
Phi Le Nguyen
157
2
0
27 Oct 2025
InfMasking: Unleashing Synergistic Information by Contrastive Multimodal Interactions
InfMasking: Unleashing Synergistic Information by Contrastive Multimodal Interactions
Liangjian Wen
Qun Dai
Jianzhuang Liu
Jiangtao Zheng
Yong Dai
Dongkai Wang
Zhao Kang
Jun Wang
Z. Xu
Jiang Duan
327
1
0
28 Sep 2025
Manager: Aggregating Insights from Unimodal Experts in Two-Tower VLMs and MLLMs
Manager: Aggregating Insights from Unimodal Experts in Two-Tower VLMs and MLLMs
Xiao Xu
L. Qin
Wanxiang Che
Min-Yen Kan
MoEVLM
335
1
0
13 Jun 2025
MCFNet: A Multimodal Collaborative Fusion Network for Fine-Grained Semantic Classification
MCFNet: A Multimodal Collaborative Fusion Network for Fine-Grained Semantic Classification
Yang Qiao
Xiaoyu Zhong
Xiaofeng Gu
Zhiguo Yu
267
0
0
29 May 2025
A Comprehensive Survey of Large Language Models and Multimodal Large Language Models in Medicine
A Comprehensive Survey of Large Language Models and Multimodal Large Language Models in MedicineInformation Fusion (Inf. Fusion), 2024
Hanguang Xiao
Feizhong Zhou
Xianglong Liu
Tianqi Liu
Zhipeng Li
Xin Liu
Xiaoxuan Huang
AILawLM&MALRM
541
107
0
31 Dec 2024
ChitroJera: A Regionally Relevant Visual Question Answering Dataset for Bangla
ChitroJera: A Regionally Relevant Visual Question Answering Dataset for Bangla
Deeparghya Dutta Barua
Md Sakib Ul Rahman Sourove
Md Fahim
Fabiha Haider
Fariha Tanjim Shifat
Md Tasmim Rahman Adib
Anam Borhan Uddin
Md Farhan Ishmam
Md Farhad Alam
279
4
0
19 Oct 2024
Multi-modal Intermediate Feature Interaction AutoEncoder for Overall
  Survival Prediction of Esophageal Squamous Cell Cancer
Multi-modal Intermediate Feature Interaction AutoEncoder for Overall Survival Prediction of Esophageal Squamous Cell CancerIEEE International Symposium on Biomedical Imaging (ISBI), 2024
Chengyu Wu
Yatao Zhang
Yaqi Wang
Qifeng Wang
Shuai Wang
109
1
0
23 Aug 2024
BrewCLIP: A Bifurcated Representation Learning Framework for
  Audio-Visual Retrieval
BrewCLIP: A Bifurcated Representation Learning Framework for Audio-Visual Retrieval
Zhenyu Lu
Lakshay Sethi
245
0
0
19 Aug 2024
Similarity Guided Multimodal Fusion Transformer for Semantic Location
  Prediction in Social Media
Similarity Guided Multimodal Fusion Transformer for Semantic Location Prediction in Social Media
Zhizhen Zhang
Ning Wang
Haojie Li
Zhihui Wang
275
1
0
09 May 2024
Efficient Multiscale Multimodal Bottleneck Transformer for Audio-Video
  Classification
Efficient Multiscale Multimodal Bottleneck Transformer for Audio-Video Classification
Wentao Zhu
355
8
0
08 Jan 2024
Efficient Selective Audio Masked Multimodal Bottleneck Transformer for
  Audio-Video Classification
Efficient Selective Audio Masked Multimodal Bottleneck Transformer for Audio-Video Classification
Wentao Zhu
222
5
0
08 Jan 2024
Multimodal Graph Learning for Generative Tasks
Multimodal Graph Learning for Generative Tasks
Minji Yoon
Jing Yu Koh
Bryan Hooi
Ruslan Salakhutdinov
204
23
0
11 Oct 2023
MEGA: Multimodal Alignment Aggregation and Distillation For Cinematic
  Video Segmentation
MEGA: Multimodal Alignment Aggregation and Distillation For Cinematic Video SegmentationIEEE International Conference on Computer Vision (ICCV), 2023
Najmeh Sadoughi
Xinyu Li
Avijit Vajpayee
D. Fan
Bing Shuai
H. Santos-Villalobos
Vimal Bhat
M. Rohith
298
6
0
22 Aug 2023
Read, Look or Listen? What's Needed for Solving a Multimodal Dataset
Read, Look or Listen? What's Needed for Solving a Multimodal Dataset
Netta Madvil
Yonatan Bitton
Roy Schwartz
282
3
0
06 Jul 2023
Weakly-Supervised Learning of Visual Relations in Multimodal Pretraining
Weakly-Supervised Learning of Visual Relations in Multimodal PretrainingConference on Empirical Methods in Natural Language Processing (EMNLP), 2023
Emanuele Bugliarello
Aida Nematzadeh
Lisa Anne Hendricks
SSL
345
6
0
23 May 2023
Brain encoding models based on multimodal transformers can transfer
  across language and vision
Brain encoding models based on multimodal transformers can transfer across language and visionNeural Information Processing Systems (NeurIPS), 2023
Jerry Tang
Meng Du
Vy A. Vo
Vasudev Lal
Alexander G. Huth
283
59
0
20 May 2023
Measuring Progress in Fine-grained Vision-and-Language Understanding
Measuring Progress in Fine-grained Vision-and-Language UnderstandingAnnual Meeting of the Association for Computational Linguistics (ACL), 2023
Emanuele Bugliarello
Laurent Sartran
Aishwarya Agrawal
Lisa Anne Hendricks
Aida Nematzadeh
VLM
267
31
0
12 May 2023
Musketeer: Joint Training for Multi-task Vision Language Model with Task
  Explanation Prompts
Musketeer: Joint Training for Multi-task Vision Language Model with Task Explanation Prompts
Zhaoyang Zhang
Yantao Shen
Kunyu Shi
Zhaowei Cai
Jun Fang
Siqi Deng
Hao Yang
Davide Modolo
Zhuowen Tu
Stefano Soatto
VLM
349
3
0
11 May 2023
Making the Most of What You Have: Adapting Pre-trained Visual Language
  Models in the Low-data Regime
Making the Most of What You Have: Adapting Pre-trained Visual Language Models in the Low-data Regime
Chuhan Zhang
Antoine Miech
Jiajun Shen
Jean-Baptiste Alayrac
Pauline Luc
VLMVPVLM
269
2
0
03 May 2023
In-Context Learning Unlocked for Diffusion Models
In-Context Learning Unlocked for Diffusion ModelsNeural Information Processing Systems (NeurIPS), 2023
Zhendong Wang
Lezhi Li
Yadong Lu
Yelong Shen
Pengcheng He
Weizhu Chen
Zinan Lin
Mingyuan Zhou
VLMDiffM
439
108
0
01 May 2023
Probing Conceptual Understanding of Large Visual-Language Models
Probing Conceptual Understanding of Large Visual-Language Models
Madeline Chantry Schiappa
Raiyaan Abdullah
Shehreen Azad
Jared Claypoole
Michael Cogswell
Ajay Divakaran
Yogesh S Rawat
410
24
0
07 Apr 2023
Self-Supervised Multimodal Learning: A Survey
Self-Supervised Multimodal Learning: A SurveyIEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2023
Yongshuo Zong
Oisin Mac Aodha
Timothy M. Hospedales
SSL
450
109
0
31 Mar 2023
Cross-Modal Implicit Relation Reasoning and Aligning for Text-to-Image
  Person Retrieval
Cross-Modal Implicit Relation Reasoning and Aligning for Text-to-Image Person RetrievalComputer Vision and Pattern Recognition (CVPR), 2023
Ding Jiang
Mang Ye
302
305
0
22 Mar 2023
Transformers in Speech Processing: A Survey
Transformers in Speech Processing: A Survey
S. Latif
Aun Zaidi
Heriberto Cuayáhuitl
Fahad Shamshad
Moazzam Shoukat
Muhammad Usama
Junaid Qadir
515
76
0
21 Mar 2023
A Simple Framework for Open-Vocabulary Segmentation and Detection
A Simple Framework for Open-Vocabulary Segmentation and DetectionIEEE International Conference on Computer Vision (ICCV), 2023
Hao Zhang
Feng Li
Xueyan Zou
Siyi Liu
Chun-yue Li
Jianfeng Gao
Jianwei Yang
Lei Zhang
ObjDVLM
653
238
0
14 Mar 2023
Refined Vision-Language Modeling for Fine-grained Multi-modal
  Pre-training
Refined Vision-Language Modeling for Fine-grained Multi-modal Pre-training
Lisai Zhang
Qingcai Chen
Zhijian Chen
Yunpeng Han
Zhonghua Li
Bo Zhao
VLM
227
1
0
09 Mar 2023
SwinCross: Cross-modal Swin Transformer for Head-and-Neck Tumor
  Segmentation in PET/CT Images
SwinCross: Cross-modal Swin Transformer for Head-and-Neck Tumor Segmentation in PET/CT ImagesMedical Physics (Lancaster) (Med. Phys.), 2023
Gary Y. Li
Junyu Chen
Se-In Jang
Kuang Gong
Shijie Zhao
ViTMedIm
241
25
0
08 Feb 2023
ClimaX: A foundation model for weather and climate
ClimaX: A foundation model for weather and climateInternational Conference on Machine Learning (ICML), 2023
Tung Nguyen
Johannes Brandstetter
Ashish Kapoor
Jayesh K. Gupta
Aditya Grover
AI4ClAI4CE
713
415
0
24 Jan 2023
Generalized Decoding for Pixel, Image, and Language
Generalized Decoding for Pixel, Image, and LanguageComputer Vision and Pattern Recognition (CVPR), 2022
Xueyan Zou
Zi-Yi Dou
Jianwei Yang
Zhe Gan
Linjie Li
...
Lu Yuan
Nanyun Peng
Lijuan Wang
Yong Jae Lee
Jianfeng Gao
VLMMLLMObjD
394
355
0
21 Dec 2022
Compound Tokens: Channel Fusion for Vision-Language Representation
  Learning
Compound Tokens: Channel Fusion for Vision-Language Representation Learning
Maxwell Mbabilla Aladago
A. Piergiovanni
224
3
0
02 Dec 2022
Understanding Cross-modal Interactions in V&L Models that Generate Scene
  Descriptions
Understanding Cross-modal Interactions in V&L Models that Generate Scene Descriptions
Michele Cafagna
Kees van Deemter
Albert Gatt
CoGe
200
4
0
09 Nov 2022
Late Fusion with Triplet Margin Objective for Multimodal Ideology
  Prediction and Analysis
Late Fusion with Triplet Margin Objective for Multimodal Ideology Prediction and AnalysisConference on Empirical Methods in Natural Language Processing (EMNLP), 2022
Changyuan Qiu
Winston Wu
Xinliang Frederick Zhang
Lu Wang
192
1
0
04 Nov 2022
Training Vision-Language Models with Less Bimodal Supervision
Training Vision-Language Models with Less Bimodal SupervisionConference on Automated Knowledge Base Construction (AKBC), 2022
Elad Segal
Ben Bogin
Jonathan Berant
VLM
153
2
0
01 Nov 2022
Multimodal Transformer for Parallel Concatenated Variational
  Autoencoders
Multimodal Transformer for Parallel Concatenated Variational Autoencoders
Stephen D. Liang
J. Mendel
ViT
309
6
0
28 Oct 2022
One does not fit all! On the Complementarity of Vision Encoders for
  Vision and Language Tasks
One does not fit all! On the Complementarity of Vision Encoders for Vision and Language TasksWorkshop on Representation Learning for NLP (RepL4NLP), 2022
Gregor Geigle
Chen Cecilia Liu
Jonas Pfeiffer
Iryna Gurevych
VLM
236
1
0
12 Oct 2022
Foundations and Trends in Multimodal Machine Learning: Principles,
  Challenges, and Open Questions
Foundations and Trends in Multimodal Machine Learning: Principles, Challenges, and Open QuestionsACM Computing Surveys (ACM CSUR), 2022
Paul Pu Liang
Amir Zadeh
Louis-Philippe Morency
363
218
0
07 Sep 2022
Efficient Vision-Language Pretraining with Visual Concepts and
  Hierarchical Alignment
Efficient Vision-Language Pretraining with Visual Concepts and Hierarchical AlignmentBritish Machine Vision Conference (BMVC), 2022
Mustafa Shukor
Guillaume Couairon
Matthieu Cord
VLMCLIP
394
28
0
29 Aug 2022
Contrastive Audio-Language Learning for Music
Contrastive Audio-Language Learning for MusicInternational Society for Music Information Retrieval Conference (ISMIR), 2022
Ilaria Manco
Emmanouil Benetos
Elio Quinton
Gyorgy Fazekas
409
64
0
25 Aug 2022
AutoTransition: Learning to Recommend Video Transition Effects
AutoTransition: Learning to Recommend Video Transition EffectsEuropean Conference on Computer Vision (ECCV), 2022
Yaojie Shen
Libo Zhang
Kai Xu
Xiaojie Jin
VGen
211
14
0
27 Jul 2022
Vision-and-Language Pretraining
Vision-and-Language Pretraining
Thong Nguyen
Cong-Duy Nguyen
Xiaobao Wu
See-Kiong Ng
Anh Tuan Luu
VLMCLIP
322
2
0
05 Jul 2022
VL-CheckList: Evaluating Pre-trained Vision-Language Models with
  Objects, Attributes and Relations
VL-CheckList: Evaluating Pre-trained Vision-Language Models with Objects, Attributes and Relations
Tiancheng Zhao
Tianqi Zhang
Mingwei Zhu
Haozhan Shen
Kyusong Lee
Xiaopeng Lu
Jianwei Yin
VLMCoGeMLLM
391
121
0
01 Jul 2022
BridgeTower: Building Bridges Between Encoders in Vision-Language
  Representation Learning
BridgeTower: Building Bridges Between Encoders in Vision-Language Representation LearningAAAI Conference on Artificial Intelligence (AAAI), 2022
Xiao Xu
Chenfei Wu
Shachar Rosenman
Vasudev Lal
Wanxiang Che
Nan Duan
332
96
0
17 Jun 2022
Zero-Shot Video Question Answering via Frozen Bidirectional Language
  Models
Zero-Shot Video Question Answering via Frozen Bidirectional Language ModelsNeural Information Processing Systems (NeurIPS), 2022
Antoine Yang
Antoine Miech
Josef Sivic
Ivan Laptev
Cordelia Schmid
569
285
0
16 Jun 2022
Evaluating Self-Supervised Learning for Molecular Graph Embeddings
Evaluating Self-Supervised Learning for Molecular Graph EmbeddingsNeural Information Processing Systems (NeurIPS), 2022
Hanchen Wang
Jean Kaddour
Shengchao Liu
Jian Tang
Joan Lasenby
Qi Liu
406
34
0
16 Jun 2022
Multimodal Learning with Transformers: A Survey
Multimodal Learning with Transformers: A SurveyIEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2022
Peng Xu
Xiatian Zhu
David Clifton
ViT
664
967
0
13 Jun 2022
Flamingo: a Visual Language Model for Few-Shot Learning
Flamingo: a Visual Language Model for Few-Shot LearningNeural Information Processing Systems (NeurIPS), 2022
Jean-Baptiste Alayrac
Jeff Donahue
Pauline Luc
Antoine Miech
Iain Barr
...
Mikolaj Binkowski
Ricardo Barreira
Oriol Vinyals
Andrew Zisserman
Karen Simonyan
MLLMVLM
869
5,564
0
29 Apr 2022
High-Modality Multimodal Transformer: Quantifying Modality & Interaction
  Heterogeneity for High-Modality Representation Learning
High-Modality Multimodal Transformer: Quantifying Modality & Interaction Heterogeneity for High-Modality Representation Learning
Paul Pu Liang
Yiwei Lyu
Xiang Fan
Jeffrey Tsaw
Yudong Liu
Shentong Mo
Dani Yogatama
Louis-Philippe Morency
Ruslan Salakhutdinov
316
47
0
02 Mar 2022
Distilled Dual-Encoder Model for Vision-Language Understanding
Distilled Dual-Encoder Model for Vision-Language Understanding
Zekun Wang
Wenhui Wang
Haichao Zhu
Ming Liu
Bing Qin
Furu Wei
VLMFedML
248
36
0
16 Dec 2021
MLP Architectures for Vision-and-Language Modeling: An Empirical Study
MLP Architectures for Vision-and-Language Modeling: An Empirical Study
Yi-Liang Nie
Linjie Li
Zhe Gan
Shuohang Wang
Chenguang Zhu
Michael Zeng
Zicheng Liu
Joey Tianyi Zhou
Lijuan Wang
189
10
0
08 Dec 2021
12
Next
Page 1 of 2