Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2107.07651
Cited By
Align before Fuse: Vision and Language Representation Learning with Momentum Distillation
16 July 2021
Junnan Li
Ramprasaath R. Selvaraju
Akhilesh Deepak Gotmare
Shafiq R. Joty
Caiming Xiong
S. Hoi
FaML
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Align before Fuse: Vision and Language Representation Learning with Momentum Distillation"
50 / 1,193 papers shown
Title
Masked Path Modeling for Vision-and-Language Navigation
Zi-Yi Dou
Feng Gao
Nanyun Peng
LM&Ro
26
3
0
23 May 2023
Training Transitive and Commutative Multimodal Transformers with LoReTTa
Manuel Tran
Yashin Dicente Cid
Amal Lahiani
Fabian J. Theis
Tingying Peng
Eldad Klaiman
18
2
0
23 May 2023
DetGPT: Detect What You Need via Reasoning
Renjie Pi
Jiahui Gao
Shizhe Diao
Rui Pan
Hanze Dong
...
Lewei Yao
Jianhua Han
Hang Xu
Lingpeng Kong Tong Zhang
Tong Zhang
LRM
LM&Ro
22
92
0
23 May 2023
Can Language Models Understand Physical Concepts?
Lei Li
Jingjing Xu
Qingxiu Dong
Ce Zheng
Qi Liu
Lingpeng Kong
Xu Sun
ALM
25
18
0
23 May 2023
CLIP4STR: A Simple Baseline for Scene Text Recognition with Pre-trained Vision-Language Model
Shuai Zhao
Xiaohan Wang
Linchao Zhu
Yezhou Yang
CLIP
VLM
19
25
0
23 May 2023
UNIMO-3: Multi-granularity Interaction for Vision-Language Representation Learning
Hao-Yu Yang
Can Gao
Hao Liu
Xinyan Xiao
Yanyan Zhao
Bing Qin
23
2
0
23 May 2023
RaSa: Relation and Sensitivity Aware Representation Learning for Text-based Person Search
Yang Bai
Ming-Ming Cao
Daming Gao
Ziqiang Cao
Cheng Chen
Zhenfeng Fan
Liqiang Nie
Min Zhang
AI4TS
75
53
0
23 May 2023
EDIS: Entity-Driven Image Search over Multimodal Web Content
Siqi Liu
Weixi Feng
Tsu-jui Fu
Wenhu Chen
W. Wang
VLM
48
9
0
23 May 2023
VideoLLM: Modeling Video Sequence with Large Language Models
Guo Chen
Yin-Dong Zheng
Jiahao Wang
Jilan Xu
Yifei Huang
...
Yi Wang
Yali Wang
Yu Qiao
Tong Lu
Limin Wang
MLLM
92
76
0
22 May 2023
Text-based Person Search without Parallel Image-Text Data
Yang Bai
Jingyao Wang
Min Cao
Cheng Chen
Ziqiang Cao
Liqiang Nie
Min Zhang
25
13
0
22 May 2023
Album Storytelling with Iterative Story-aware Captioning and Large Language Models
Munan Ning
Yujia Xie
Dongdong Chen
Zeyin Song
Lu Yuan
Yonghong Tian
QiXiang Ye
Liuliang Yuan
19
8
0
22 May 2023
Not All Semantics are Created Equal: Contrastive Self-supervised Learning with Automatic Temperature Individualization
Zimeng Qiu
Quanqi Hu
Zhuoning Yuan
Denny Zhou
Lijun Zhang
Tianbao Yang
32
17
0
19 May 2023
Enhancing Vision-Language Pre-Training with Jointly Learned Questioner and Dense Captioner
Zikang Liu
Sihan Chen
Longteng Guo
Handong Li
Xingjian He
J. Liu
13
1
0
19 May 2023
AttriCLIP: A Non-Incremental Learner for Incremental Knowledge Learning
Runqi Wang
Xiaoyue Duan
Guoliang Kang
Jianzhuang Liu
Shaohui Lin
Songcen Xu
Jinhu Lv
Baochang Zhang
CLL
VLM
10
29
0
19 May 2023
Going Denser with Open-Vocabulary Part Segmentation
Pei Sun
Shoufa Chen
Chenchen Zhu
Fanyi Xiao
Ping Luo
Saining Xie
Zhicheng Yan
ObjD
VLM
20
45
0
18 May 2023
ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities
Peng Wang
Shijie Wang
Junyang Lin
Shuai Bai
Xiaohuan Zhou
Jingren Zhou
Xinggang Wang
Chang Zhou
VLM
MLLM
ObjD
26
114
0
18 May 2023
ProgSG: Cross-Modality Representation Learning for Programs in Electronic Design Automation
Yunsheng Bai
Atefeh Sohrabizadeh
Zongyue Qin
Ziniu Hu
Yizhou Sun
Jason Cong
18
1
0
18 May 2023
When Search Meets Recommendation: Learning Disentangled Search Representation for Recommendation
Zihua Si
ZhongXiang Sun
Xiao Zhang
Jun Xu
Xiaoxue Zang
Yang Song
Kun Gai
Jirong Wen
AI4TS
21
20
0
18 May 2023
MedBLIP: Bootstrapping Language-Image Pre-training from 3D Medical Images and Texts
Qiuhui Chen
Xinyue Hu
Zirui Wang
Yi Hong
LM&MA
MedIm
14
34
0
18 May 2023
Segment Any Anomaly without Training via Hybrid Prompt Regularization
Yunkang Cao
Xiaohao Xu
Chen Sun
Y. Cheng
Zongwei Du
Liang Gao
Weiming Shen
VLM
31
70
0
18 May 2023
Vision-Language Pre-training with Object Contrastive Learning for 3D Scene Understanding
Zhang Tao
Su He
D. Tao
Bin Chen
Zhi Wang
Shutao Xia
VLM
27
21
0
18 May 2023
Paxion: Patching Action Knowledge in Video-Language Foundation Models
Zhenhailong Wang
Ansel Blume
Sha Li
Genglin Liu
Jaemin Cho
Zineng Tang
Mohit Bansal
Heng Ji
KELM
VGen
17
26
0
18 May 2023
IMAD: IMage-Augmented multi-modal Dialogue
Viktor Moskvoretskii
Anton Frolov
Denis Kuznetsov
22
4
0
17 May 2023
Evaluating Object Hallucination in Large Vision-Language Models
Yifan Li
Yifan Du
Kun Zhou
Jinpeng Wang
Wayne Xin Zhao
Ji-Rong Wen
MLLM
LRM
88
691
0
17 May 2023
TG-VQA: Ternary Game of Video Question Answering
Hao Li
Peng Jin
Ze-Long Cheng
Songyang Zhang
Kai-xiang Chen
Zhennan Wang
Chang-rui Liu
Jie Chen
26
10
0
17 May 2023
UniS-MMC: Multimodal Classification via Unimodality-supervised Multimodal Contrastive Learning
Heqing Zou
Meng Shen
Chen Chen
Yuchen Hu
D. Rajan
Chng Eng Siong
SSL
32
15
0
16 May 2023
Cross-Modal Global Interaction and Local Alignment for Audio-Visual Speech Recognition
Yuchen Hu
Ruizhe Li
Chen Chen
Heqing Zou
Qiu-shi Zhu
E. Chng
26
7
0
16 May 2023
Bridging the Domain Gap: Self-Supervised 3D Scene Understanding with Foundation Models
Zhimin Chen
Longlong Jing
Yingwei Li
Bing Li
24
31
0
15 May 2023
CLIP-VG: Self-paced Curriculum Adapting of CLIP for Visual Grounding
Linhui Xiao
Xiaoshan Yang
Fang Peng
Ming Yan
Yaowei Wang
Changsheng Xu
ObjD
VLM
29
30
0
15 May 2023
Parameter-efficient Tuning of Large-scale Multimodal Foundation Model
Haixin Wang
Xinlong Yang
Jianlong Chang
Di Jin
Jinan Sun
Shikun Zhang
Xiao Luo
Qi Tian
25
23
0
15 May 2023
ULIP-2: Towards Scalable Multimodal Pre-training for 3D Understanding
Le Xue
Ning Yu
Shu Zhen Zhang
Artemis Panagopoulou
Junnan Li
...
Jiajun Wu
Caiming Xiong
Ran Xu
Juan Carlos Niebles
Silvio Savarese
19
115
0
14 May 2023
CodeT5+: Open Code Large Language Models for Code Understanding and Generation
Yue Wang
Hung Le
Akhilesh Deepak Gotmare
Nghi D. Q. Bui
Junnan Li
Steven C. H. Hoi
ALM
19
460
0
13 May 2023
Measuring Progress in Fine-grained Vision-and-Language Understanding
Emanuele Bugliarello
Laurent Sartran
Aishwarya Agrawal
Lisa Anne Hendricks
Aida Nematzadeh
VLM
28
22
0
12 May 2023
IMAGINATOR: Pre-Trained Image+Text Joint Embeddings using Word-Level Grounding of Images
Varuna Krishna
S. Suryavardan
Shreyash Mishra
Sathyanarayanan Ramamoorthy
Parth Patwa
Megha Chakraborty
Aman Chadha
Amitava Das
Amit P. Sheth
VLM
17
3
0
12 May 2023
InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning
Wenliang Dai
Junnan Li
Dongxu Li
A. M. H. Tiong
Junqi Zhao
Weisheng Wang
Boyang Albert Li
Pascale Fung
Steven C. H. Hoi
MLLM
VLM
17
1,898
0
11 May 2023
Multi-Prompt with Depth Partitioned Cross-Modal Learning
Yingjie Tian
Yiqi Wang
Xianda Guo
Zheng Hua Zhu
Long Chen
VLM
18
0
0
10 May 2023
Vision-Language Models in Remote Sensing: Current Progress and Future Trends
Xiang Li
Congcong Wen
Yuan Hu
Zhenghang Yuan
Xiao Xiang Zhu
VLM
16
71
0
09 May 2023
Vision Language Pre-training by Contrastive Learning with Cross-Modal Similarity Regulation
Chaoya Jiang
Wei Ye
Haiyang Xu
Miang yan
Shikun Zhang
Jie Zhang
Fei Huang
VLM
21
15
0
08 May 2023
Cross-Modal Retrieval for Motion and Text via DopTriple Loss
Sheng Yan
Yang Liu
Haoqiang Wang
Xin Du
Mengyuan Liu
Hong Liu
22
8
0
07 May 2023
Structure-CLIP: Towards Scene Graph Knowledge to Enhance Multi-modal Structured Representations
Yufen Huang
Jiji Tang
Zhuo Chen
Rongsheng Zhang
Xinfeng Zhang
...
Zeng Zhao
Zhou Zhao
Tangjie Lv
Zhipeng Hu
Wen Zhang
VLM
15
21
0
06 May 2023
COLA: A Benchmark for Compositional Text-to-image Retrieval
Arijit Ray
Filip Radenovic
Abhimanyu Dubey
Bryan A. Plummer
Ranjay Krishna
Kate Saenko
CoGe
VLM
38
34
0
05 May 2023
Retrieval Augmented Chest X-Ray Report Generation using OpenAI GPT models
M. Ranjit
G. Ganapathy
R. Manuel
T. Ganu
MedIm
LM&MA
15
26
0
05 May 2023
Making the Most of What You Have: Adapting Pre-trained Visual Language Models in the Low-data Regime
Chuhan Zhang
Antoine Miech
Jiajun Shen
Jean-Baptiste Alayrac
Pauline Luc
VLM
VPVLM
39
2
0
03 May 2023
A Neural Divide-and-Conquer Reasoning Framework for Image Retrieval from Linguistically Complex Text
Yunxin Li
Baotian Hu
Yuxin Ding
Lin Ma
M. Zhang
23
5
0
03 May 2023
Transforming Visual Scene Graphs to Image Captions
Xu Yang
Jiawei Peng
Zihua Wang
Haiyang Xu
Qinghao Ye
Chenliang Li
Mingshi Yan
Feisi Huang
Zhangzikang Li
Yu Zhang
39
19
0
03 May 2023
VPGTrans: Transfer Visual Prompt Generator across LLMs
Ao Zhang
Hao Fei
Yuan Yao
Wei Ji
Li Li
Zhiyuan Liu
Tat-Seng Chua
MLLM
VLM
27
85
0
02 May 2023
Click-Feedback Retrieval
Zeyu Wang
Yuehua Wu
24
0
0
28 Apr 2023
An Empirical Study of Multimodal Model Merging
Yi-Lin Sung
Linjie Li
Kevin Qinghong Lin
Zhe Gan
Mohit Bansal
Lijuan Wang
MoMe
15
40
0
28 Apr 2023
VERITE: A Robust Benchmark for Multimodal Misinformation Detection Accounting for Unimodal Bias
Stefanos-Iordanis Papadopoulos
C. Koutlis
Symeon Papadopoulos
P. Petrantonakis
72
19
0
27 Apr 2023
Retrieval-based Knowledge Augmented Vision Language Pre-training
Jiahua Rao
Zifei Shan
Long Liu
Yao Zhou
Yuedong Yang
VLM
80
13
0
27 Apr 2023
Previous
1
2
3
...
15
16
17
...
22
23
24
Next