Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2107.07651
Cited By
Align before Fuse: Vision and Language Representation Learning with Momentum Distillation
16 July 2021
Junnan Li
Ramprasaath R. Selvaraju
Akhilesh Deepak Gotmare
Shafiq R. Joty
Caiming Xiong
S. Hoi
FaML
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Align before Fuse: Vision and Language Representation Learning with Momentum Distillation"
50 / 1,191 papers shown
Title
Vision and Structured-Language Pretraining for Cross-Modal Food Retrieval
Mustafa Shukor
Nicolas Thome
Matthieu Cord
CLIP
CoGe
19
8
0
08 Dec 2022
Graph Matching with Bi-level Noisy Correspondence
Yijie Lin
Mouxing Yang
Jun Yu
Peng Hu
Changqing Zhang
Xiaocui Peng
27
32
0
08 Dec 2022
G-MAP: General Memory-Augmented Pre-trained Language Model for Domain Tasks
Zhongwei Wan
Yichun Yin
Wei Zhang
Jiaxin Shi
Lifeng Shang
Guangyong Chen
Xin Jiang
Qun Liu
VLM
CLL
21
16
0
07 Dec 2022
SimVTP: Simple Video Text Pre-training with Masked Autoencoders
Yue Ma
Tianyu Yang
Yin Shan
Xiu Li
25
27
0
07 Dec 2022
PØDA: Prompt-driven Zero-shot Domain Adaptation
Mohammad Fahes
Tuan-Hung Vu
Andrei Bursuc
Patrick Pérez
Raoul de Charette
VLM
36
45
0
06 Dec 2022
InternVideo: General Video Foundation Models via Generative and Discriminative Learning
Yi Wang
Kunchang Li
Yizhuo Li
Yinan He
Bingkun Huang
...
Junting Pan
Jiashuo Yu
Yali Wang
Limin Wang
Yu Qiao
VLM
VGen
38
309
0
06 Dec 2022
LUNA: Language Understanding with Number Augmentations on Transformers via Number Plugins and Pre-training
Hongwei Han
Jialiang Xu
Mengyuan Zhou
Yijia Shao
Shi Han
Dongmei Zhang
LMTD
19
7
0
06 Dec 2022
Controllable Image Captioning via Prompting
Ning Wang
Jiahao Xie
Jihao Wu
Mingbo Jia
Linlin Li
14
23
0
04 Dec 2022
Compound Tokens: Channel Fusion for Vision-Language Representation Learning
Maxwell Mbabilla Aladago
A. Piergiovanni
19
1
0
02 Dec 2022
Normalized Contrastive Learning for Text-Video Retrieval
Yookoon Park
Mahmoud Azab
Bo Xiong
Seungwhan Moon
Florian Metze
Gourab Kundu
Kirmani Ahmed
17
11
0
30 Nov 2022
Improving Commonsense in Vision-Language Models via Knowledge Graph Riddles
Shuquan Ye
Yujia Xie
Dongdong Chen
Yichong Xu
Lu Yuan
Chenguang Zhu
Jing Liao
VLM
19
11
0
29 Nov 2022
SLAN: Self-Locator Aided Network for Cross-Modal Understanding
Jiang-Tian Zhai
Qi Zhang
Tong Wu
Xinghan Chen
Jiangjiang Liu
Bo Ren
Ming-Ming Cheng
ObjD
VLM
23
1
0
28 Nov 2022
Alignment-Enriched Tuning for Patch-Level Pre-trained Document Image Models
Lei Wang
Jian He
Xingdong Xu
Ning Liu
Hui-juan Liu
27
2
0
27 Nov 2022
Exploring Consistency in Cross-Domain Transformer for Domain Adaptive Semantic Segmentation
Kaihong Wang
Donghyun Kim
Regerio Feris
Kate Saenko
Margrit Betke
ViT
20
4
0
27 Nov 2022
CLIP-ReID: Exploiting Vision-Language Model for Image Re-Identification without Concrete Text Labels
Siyuan Li
Li Sun
Qingli Li
VLM
28
148
0
25 Nov 2022
Self-supervised vision-language pretraining for Medical visual question answering
Pengfei Li
Gang Liu
Lin Tan
Jinying Liao
Shenjun Zhong
MedIm
14
33
0
24 Nov 2022
Seeing What You Miss: Vision-Language Pre-training with Semantic Completion Learning
Yatai Ji
Rong-Cheng Tu
Jie Jiang
Weijie Kong
Chengfei Cai
Wenzhe Zhao
Hongfa Wang
Yujiu Yang
Wei Liu
VLM
32
13
0
24 Nov 2022
Open-vocabulary Attribute Detection
M. A. Bravo
Sudhanshu Mittal
Simon Ging
Thomas Brox
VLM
ObjD
14
30
0
23 Nov 2022
VoP: Text-Video Co-operative Prompt Tuning for Cross-Modal Retrieval
Siteng Huang
Biao Gong
Yulin Pan
Jianwen Jiang
Yiliang Lv
Yuyuan Li
Donglin Wang
VLM
VPVLM
16
41
0
23 Nov 2022
X
2
^2
2
-VLM: All-In-One Pre-trained Model For Vision-Language Tasks
Yan Zeng
Xinsong Zhang
Hang Li
Jiawei Wang
Jipeng Zhang
Hkust Wangchunshu Zhou
VLM
MLLM
21
14
0
22 Nov 2022
Teaching Structured Vision&Language Concepts to Vision&Language Models
Sivan Doveh
Assaf Arbelle
Sivan Harary
Rameswar Panda
Roei Herzig
...
Donghyun Kim
Raja Giryes
Rogerio Feris
S. Ullman
Leonid Karlinsky
VLM
CoGe
45
70
0
21 Nov 2022
Multitask Vision-Language Prompt Tuning
Sheng Shen
Shijia Yang
Tianjun Zhang
Bohan Zhai
Joseph E. Gonzalez
Kurt Keutzer
Trevor Darrell
VLM
VPVLM
17
49
0
21 Nov 2022
Exploring Discrete Diffusion Models for Image Captioning
Zixin Zhu
Yixuan Wei
Jianfeng Wang
Zhe Gan
Zheng-Wei Zhang
Le Wang
G. Hua
Lijuan Wang
Zicheng Liu
Han Hu
DiffM
VLM
23
17
0
21 Nov 2022
ClipCrop: Conditioned Cropping Driven by Vision-Language Model
Zhihang Zhong
Mingxi Cheng
Zhirong Wu
Yuhui Yuan
Yinqiang Zheng
Ji Li
Han Hu
Stephen Lin
Yoichi Sato
Imari Sato
VLM
CLIP
25
3
0
21 Nov 2022
Cross-Modal Contrastive Learning for Robust Reasoning in VQA
Qinjie Zheng
Chaoyue Wang
Daqing Liu
Dadong Wang
Dacheng Tao
LRM
19
0
0
21 Nov 2022
Unifying Vision-Language Representation Space with Single-tower Transformer
Jiho Jang
Chaerin Kong
D. Jeon
Seonhoon Kim
Nojun Kwak
25
19
0
21 Nov 2022
Leveraging per Image-Token Consistency for Vision-Language Pre-training
Yunhao Gou
Tom Ko
Hansi Yang
James T. Kwok
Yu Zhang
Mingxuan Wang
VLM
16
9
0
20 Nov 2022
A survey on knowledge-enhanced multimodal learning
Maria Lymperaiou
Giorgos Stamou
28
13
0
19 Nov 2022
Bidirectional Generation of Structure and Properties Through a Single Molecular Foundation Model
Jinho Chang
Jong Chul Ye
AI4CE
22
29
0
19 Nov 2022
CL-CrossVQA: A Continual Learning Benchmark for Cross-Domain Visual Question Answering
Yao Zhang
Haokun Chen
A. Frikha
Yezi Yang
Denis Krompass
Gengyuan Zhang
Jindong Gu
Volker Tresp
VLM
LRM
10
7
0
19 Nov 2022
Task Residual for Tuning Vision-Language Models
Tao Yu
Zhihe Lu
Xin Jin
Zhibo Chen
Xinchao Wang
VLM
CLIP
16
81
0
18 Nov 2022
ConStruct-VL: Data-Free Continual Structured VL Concepts Learning
James Smith
Paola Cascante-Bonilla
Assaf Arbelle
Donghyun Kim
Rameswar Panda
David D. Cox
Diyi Yang
Z. Kira
Rogerio Feris
Leonid Karlinsky
VLM
33
20
0
17 Nov 2022
I Can't Believe There's No Images! Learning Visual Tasks Using only Language Supervision
Sophia Gu
Christopher Clark
Aniruddha Kembhavi
VLM
14
24
0
17 Nov 2022
PromptCap: Prompt-Guided Task-Aware Image Captioning
Yushi Hu
Hang Hua
Zhengyuan Yang
Weijia Shi
Noah A. Smith
Jiebo Luo
28
101
0
15 Nov 2022
Zero-shot Image Captioning by Anchor-augmented Vision-Language Space Alignment
Junyan Wang
Yi Zhang
Ming Yan
Ji Zhang
Jitao Sang
VLM
25
9
0
14 Nov 2022
PMR: Prototypical Modal Rebalance for Multimodal Learning
Yunfeng Fan
Wenchao Xu
Haozhao Wang
Junxiao Wang
Song Guo
23
60
0
14 Nov 2022
ERNIE-UniX2: A Unified Cross-lingual Cross-modal Framework for Understanding and Generation
Bin Shan
Yaqian Han
Weichong Yin
Shuohuan Wang
Yu Sun
Hao Tian
Hua-Hong Wu
Haifeng Wang
MLLM
VLM
11
7
0
09 Nov 2022
Gradient Knowledge Distillation for Pre-trained Language Models
Lean Wang
Lei Li
Xu Sun
VLM
16
5
0
02 Nov 2022
Fine-grained Visual-Text Prompt-Driven Self-Training for Open-Vocabulary Object Detection
Yanxin Long
Jianhua Han
Runhu Huang
Xu Hang
Yi Zhu
Chunjing Xu
Xiaodan Liang
VLM
ObjD
19
18
0
02 Nov 2022
Training Vision-Language Models with Less Bimodal Supervision
Elad Segal
Ben Bogin
Jonathan Berant
VLM
19
2
0
01 Nov 2022
Generative Negative Text Replay for Continual Vision-Language Pretraining
Shipeng Yan
Lanqing Hong
Hang Xu
Jianhua Han
Tinne Tuytelaars
Zhenguo Li
Xuming He
VLM
CLL
CLIP
17
17
0
31 Oct 2022
UPainting: Unified Text-to-Image Diffusion Generation with Cross-modal Guidance
Wei Li
Xue Xu
Xinyan Xiao
Jiacheng Liu
Hu Yang
...
Zhanpeng Wang
Zhifan Feng
Qiaoqiao She
Yajuan Lyu
Hua-Hong Wu
110
29
0
28 Oct 2022
FaD-VLP: Fashion Vision-and-Language Pre-training towards Unified Retrieval and Captioning
Suvir Mirchandani
Licheng Yu
Mengjiao MJ Wang
Animesh Sinha
Wen-Jun Jiang
Tao Xiang
Ning Zhang
27
16
0
26 Oct 2022
What's Different between Visual Question Answering for Machine "Understanding" Versus for Accessibility?
Yang Trista Cao
Kyle Seelman
Kyungjun Lee
Hal Daumé
6
5
0
26 Oct 2022
FairCLIP: Social Bias Elimination based on Attribute Prototype Learning and Representation Neutralization
Junyan Wang
Yi Zhang
Jitao Sang
FaML
VLM
34
22
0
26 Oct 2022
Compressing And Debiasing Vision-Language Pre-Trained Models for Visual Question Answering
Q. Si
Yuanxin Liu
Zheng Lin
Peng Fu
Weiping Wang
VLM
29
1
0
26 Oct 2022
Learning by Hallucinating: Vision-Language Pre-training with Weak Supervision
T. Wang
Jorma T. Laaksonen
T. Langer
Heikki Arponen
Tom E. Bishop
VLM
16
6
0
24 Oct 2022
Multilingual Multimodal Learning with Machine Translated Text
Chen Qiu
Dan Oneaţă
Emanuele Bugliarello
Stella Frank
Desmond Elliott
38
13
0
24 Oct 2022
Towards Unifying Reference Expression Generation and Comprehension
Duo Zheng
Tao Kong
Ya Jing
Jiaan Wang
Xiaojie Wang
ObjD
27
6
0
24 Oct 2022
LiteVL: Efficient Video-Language Learning with Enhanced Spatial-Temporal Modeling
Dongsheng Chen
Chaofan Tao
Lu Hou
Lifeng Shang
Xin Jiang
Qun Liu
VLM
25
18
0
21 Oct 2022
Previous
1
2
3
...
19
20
21
22
23
24
Next