Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2107.07651
Cited By
Align before Fuse: Vision and Language Representation Learning with Momentum Distillation
16 July 2021
Junnan Li
Ramprasaath R. Selvaraju
Akhilesh Deepak Gotmare
Shafiq R. Joty
Caiming Xiong
S. Hoi
FaML
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Align before Fuse: Vision and Language Representation Learning with Momentum Distillation"
50 / 1,192 papers shown
Title
FineEHR: Refine Clinical Note Representations to Improve Mortality Prediction
Jun Wu
Xuesong Ye
Chengjie Mou
Weina Dai
54
18
0
24 Apr 2023
SATIN: A Multi-Task Metadataset for Classifying Satellite Imagery using Vision-Language Models
Jonathan Roberts
Kai Han
Samuel Albanie
VLM
27
12
0
23 Apr 2023
OmniLabel: A Challenging Benchmark for Language-Based Object Detection
S. Schulter
G. VijayKumarB.
Yumin Suh
Konstantinos M. Dafnis
Zhixing Zhang
Shiyu Zhao
Dimitris N. Metaxas
ObjD
22
11
0
22 Apr 2023
RoCOCO: Robustness Benchmark of MS-COCO to Stress-test Image-Text Matching Models
Seulki Park
Daeho Um
Hajung Yoon
Sanghyuk Chun
Sangdoo Yun
Jin Young Choi
25
2
0
21 Apr 2023
Image-text Retrieval via Preserving Main Semantics of Vision
Xu Zhang
Xinzheng Niu
Philippe Fournier-Viger
Xudong Dai
VLM
11
5
0
20 Apr 2023
SViTT: Temporal Learning of Sparse Video-Text Transformers
Yi Li
Kyle Min
Subarna Tripathi
Nuno Vasconcelos
17
12
0
18 Apr 2023
Learning Situation Hyper-Graphs for Video Question Answering
Aisha Urooj Khan
Hilde Kuehne
Bo Wu
Kim Chheu
Walid Bousselham
Chuang Gan
N. Lobo
M. Shah
34
15
0
18 Apr 2023
VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset
Sihan Chen
Xingjian He
Longteng Guo
Xinxin Zhu
Weining Wang
Jinhui Tang
Jinhui Tang
VLM
26
102
0
17 Apr 2023
Multimodal Representation Learning of Cardiovascular Magnetic Resonance Imaging
Jielin Qiu
Peide Huang
Makiya Nakashima
Jae-Hyeok Lee
Jiacheng Zhu
...
Byung-Hak Kim
Debbie Kwon
Douglas Weber
Ding Zhao
David Chen
SSL
19
4
0
16 Apr 2023
CoVLR: Coordinating Cross-Modal Consistency and Intra-Modal Structure for Vision-Language Retrieval
Yang Yang
Zhongtian Fu
Xiangyu Wu
Wenjie Li
VLM
13
1
0
15 Apr 2023
LASER: A Neuro-Symbolic Framework for Learning Spatial-Temporal Scene Graphs with Weak Supervision
Jiani Huang
Ziyang Li
Mayur Naik
Ser-Nam Lim
35
3
0
15 Apr 2023
Automated Cardiovascular Record Retrieval by Multimodal Learning between Electrocardiogram and Clinical Report
Jielin Qiu
Jiacheng Zhu
Shiqi Liu
William Jongwon Han
Jingqi Zhang
Chaojing Duan
Michael A. Rosenberg
Emerson Liu
Douglas Weber
Ding Zhao
9
0
0
13 Apr 2023
MoMo: A shared encoder Model for text, image and multi-Modal representations
Rakesh Chada
Zhao-Heng Zheng
P. Natarajan
ViT
19
4
0
11 Apr 2023
FashionSAP: Symbols and Attributes Prompt for Fine-grained Fashion Vision-Language Pre-training
Yunpeng Han
Lisai Zhang
Qingcai Chen
Zhijian Chen
Zhonghua Li
Jianxin Yang
Zhao Cao
AI4TS
VLM
26
11
0
11 Apr 2023
Token Boosting for Robust Self-Supervised Visual Transformer Pre-training
Tianjiao Li
Lin Geng Foo
Ping Hu
Xindi Shang
Hossein Rahmani
Zehuan Yuan
J. Liu
32
7
0
09 Apr 2023
Learning Instance-Level Representation for Large-Scale Multi-Modal Pretraining in E-commerce
Yang Jin
Yongzhi Li
Zehuan Yuan
Yadong Mu
22
13
0
06 Apr 2023
Detecting and Grounding Multi-Modal Media Manipulation
Rui Shao
Tianxing Wu
Ziwei Liu
32
57
0
05 Apr 2023
AToMiC: An Image/Text Retrieval Test Collection to Support Multimedia Content Creation
Jheng-Hong Yang
Carlos Lassance
Rafael Sampaio de Rezende
Krishna Srinivasan
Miriam Redi
S. Clinchant
Jimmy J. Lin
37
12
0
04 Apr 2023
Black Box Few-Shot Adaptation for Vision-Language models
Yassine Ouali
Adrian Bulat
Brais Martínez
Georgios Tzimiropoulos
VLM
26
31
0
04 Apr 2023
Not All Features Matter: Enhancing Few-shot CLIP with Adaptive Prior Refinement
Xiang-yu Zhu
Renrui Zhang
Bowei He
A-Long Zhou
Dong Wang
Bingyan Zhao
Peng Gao
VLM
27
79
0
03 Apr 2023
Multi-Modal Representation Learning with Text-Driven Soft Masks
Jaeyoo Park
Bohyung Han
SSL
17
4
0
03 Apr 2023
DIME-FM: DIstilling Multimodal and Efficient Foundation Models
Ximeng Sun
Pengchuan Zhang
Peizhao Zhang
Hardik Shah
Kate Saenko
Xide Xia
VLM
13
20
0
31 Mar 2023
Self-Supervised Multimodal Learning: A Survey
Yongshuo Zong
Oisin Mac Aodha
Timothy M. Hospedales
SSL
19
43
0
31 Mar 2023
Learning Procedure-aware Video Representation from Instructional Videos and Their Narrations
Yiwu Zhong
Licheng Yu
Yang Bai
Shangwen Li
Xueting Yan
Yin Li
AI4TS
30
31
0
31 Mar 2023
SoftCLIP: Softer Cross-modal Alignment Makes CLIP Stronger
Yuting Gao
Jinfeng Liu
Zi-Han Xu
Tong Wu
W. Liu
Jie-jin Yang
Keren Li
Xingen Sun
CLIP
VLM
25
42
0
30 Mar 2023
WavCaps: A ChatGPT-Assisted Weakly-Labelled Audio Captioning Dataset for Audio-Language Multimodal Research
Xinhao Mei
Chutong Meng
Haohe Liu
Qiuqiang Kong
Tom Ko
Chengqi Zhao
Mark D. Plumbley
Yuexian Zou
Wenwu Wang
43
192
0
30 Mar 2023
A Study of Autoregressive Decoders for Multi-Tasking in Computer Vision
Lucas Beyer
Bo Wan
Gagan Madan
Filip Pavetić
Andreas Steiner
...
Emanuele Bugliarello
Xiao Wang
Qihang Yu
Liang-Chieh Chen
Xiaohua Zhai
49
8
0
30 Mar 2023
Task-Oriented Multi-Modal Mutual Leaning for Vision-Language Models
Sifan Long
Zhen Zhao
Junkun Yuan
Zichang Tan
Jiangjiang Liu
Luping Zhou
Sheng-sheng Wang
Jingdong Wang
VLM
23
2
0
30 Mar 2023
Mask-free OVIS: Open-Vocabulary Instance Segmentation without Manual Mask Annotations
VS Vibashan
Ning Yu
Chen Xing
Can Qin
M. Gao
Juan Carlos Niebles
Vishal M. Patel
Ran Xu
VLM
ISeg
28
18
0
29 Mar 2023
MaMMUT: A Simple Architecture for Joint Learning for MultiModal Tasks
Weicheng Kuo
A. Piergiovanni
Dahun Kim
Xiyang Luo
Benjamin Caine
...
Luowei Zhou
Andrew M. Dai
Zhifeng Chen
Claire Cui
A. Angelova
MLLM
VLM
23
23
0
29 Mar 2023
Multimodal Image-Text Matching Improves Retrieval-based Chest X-Ray Report Generation
Jaehwan Jeong
Katherine Tian
Andrew Li
Sina Hartung
Fardad Behzadi
Juan Calle
David E Osayande
M. Pohlen
Subathra Adithan
Pranav Rajpurkar
MedIm
17
45
0
29 Mar 2023
Unmasked Teacher: Towards Training-Efficient Video Foundation Models
Kunchang Li
Yali Wang
Yizhuo Li
Yi Wang
Yinan He
Limin Wang
Yu Qiao
VGen
32
154
0
28 Mar 2023
CoRe-Sleep: A Multimodal Fusion Framework for Time Series Robust to Imperfect Modalities
Konstantinos Kontras
Christos Chatzichristos
Huy P Phan
Johan A. K. Suykens
Marina De Vos
AI4TS
19
11
0
27 Mar 2023
Revisiting Multimodal Representation in Contrastive Learning: From Patch and Token Embeddings to Finite Discrete Tokens
Yuxiao Chen
Jianbo Yuan
Yu Tian
Shijie Geng
Xinyu Li
Ding Zhou
Dimitris N. Metaxas
Hongxia Yang
14
33
0
27 Mar 2023
WinCLIP: Zero-/Few-Shot Anomaly Classification and Segmentation
Jongheon Jeong
Yang Zou
Taewan Kim
Dongqing Zhang
Avinash Ravichandran
O. Dabeer
VLM
67
184
0
26 Mar 2023
GestureDiffuCLIP: Gesture Diffusion Model with CLIP Latents
Tenglong Ao
Zeyi Zhang
Libin Liu
DiffM
VGen
67
144
0
26 Mar 2023
Equivariant Similarity for Vision-Language Foundation Models
Tan Wang
Kevin Qinghong Lin
Linjie Li
Chung-Ching Lin
Zhengyuan Yang
Hanwang Zhang
Zicheng Liu
Lijuan Wang
CoGe
38
44
0
25 Mar 2023
Accelerating Vision-Language Pretraining with Free Language Modeling
Teng Wang
Yixiao Ge
Feng Zheng
Ran Cheng
Ying Shan
Xiaohu Qie
Ping Luo
VLM
MLLM
89
9
0
24 Mar 2023
CoBIT: A Contrastive Bi-directional Image-Text Generation Model
Haoxuan You
Mandy Guo
Zhecan Wang
Kai-Wei Chang
Jason Baldridge
Jiahui Yu
DiffM
37
12
0
23 Mar 2023
FER-former: Multi-modal Transformer for Facial Expression Recognition
Yande Li
Mingjie Wang
Minglun Gong
Y. Lu
Li Liu
23
8
0
23 Mar 2023
CiCo: Domain-Aware Sign Language Retrieval via Cross-Lingual Contrastive Learning
Yiting Cheng
Fangyun Wei
Jianmin Bao
Dong Chen
Wenqian Zhang
SLR
24
28
0
22 Mar 2023
Cross-Modal Implicit Relation Reasoning and Aligning for Text-to-Image Person Retrieval
Ding Jiang
Mang Ye
19
140
0
22 Mar 2023
CLIP
2
^2
2
: Contrastive Language-Image-Point Pretraining from Real-World Point Cloud Data
Yi Zeng
Chenhan Jiang
Jiageng Mao
Jianhua Han
Chao Ye
Qingqiu Huang
Dit-Yan Yeung
Zhen Yang
Xiaodan Liang
Hang Xu
3DPC
VLM
CLIP
14
68
0
22 Mar 2023
Frozen Language Model Helps ECG Zero-Shot Learning
Jun Yu Li
Che Liu
Sibo Cheng
Rossella Arcucci
linda Qiao
18
59
0
22 Mar 2023
Efficient Feature Distillation for Zero-shot Annotation Object Detection
Zhuoming Liu
Xuefeng Hu
Ram Nevatia
VLM
ObjD
16
1
0
21 Mar 2023
Is BERT Blind? Exploring the Effect of Vision-and-Language Pretraining on Visual Language Understanding
Morris Alper
Michael Fiman
Hadar Averbuch-Elor
VLM
LRM
18
16
0
21 Mar 2023
Contrastive Alignment of Vision to Language Through Parameter-Efficient Transfer Learning
Zaid Khan
Yun Fu
VLM
28
12
0
21 Mar 2023
eP-ALM: Efficient Perceptual Augmentation of Language Models
Mustafa Shukor
Corentin Dancette
Matthieu Cord
MLLM
VLM
24
29
0
20 Mar 2023
CLIP goes 3D: Leveraging Prompt Tuning for Language Grounded 3D Recognition
Deepti Hegde
Jeya Maria Jose Valanarasu
Vishal M. Patel
CLIP
32
65
0
20 Mar 2023
Retrieving Multimodal Information for Augmented Generation: A Survey
Ruochen Zhao
Hailin Chen
Weishi Wang
Fangkai Jiao
Do Xuan Long
...
Bosheng Ding
Xiaobao Guo
Minzhi Li
Xingxuan Li
Shafiq R. Joty
18
80
0
20 Mar 2023
Previous
1
2
3
...
16
17
18
...
22
23
24
Next