Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales
Search
Open menu
Home
Papers
2104.12763
Cited By
v1
v2 (latest)
MDETR -- Modulated Detection for End-to-End Multi-Modal Understanding
IEEE International Conference on Computer Vision (ICCV), 2021
26 April 2021
Aishwarya Kamath
Mannat Singh
Yann LeCun
Gabriel Synnaeve
Ishan Misra
Nicolas Carion
ObjD
VLM
Re-assign community
ArXiv (abs)
PDF
HTML
Github (1008★)
Papers citing
"MDETR -- Modulated Detection for End-to-End Multi-Modal Understanding"
50 / 671 papers shown
Title
Perceiver-Actor: A Multi-Task Transformer for Robotic Manipulation
Conference on Robot Learning (CoRL), 2022
Mohit Shridhar
Lucas Manuelli
Dieter Fox
LM&Ro
511
651
0
12 Sep 2022
OmDet: Large-scale vision-language multi-dataset pre-training with multimodal detection network
IET Computer Vision (ICV), 2022
Tiancheng Zhao
Peng Liu
Kyusong Lee
VLM
MLLM
ObjD
114
13
0
10 Sep 2022
RLIP: Relational Language-Image Pre-training for Human-Object Interaction Detection
Neural Information Processing Systems (NeurIPS), 2022
Hangjie Yuan
Jianwen Jiang
Samuel Albanie
Tao Feng
Ziyuan Huang
Dong Ni
Mingqian Tang
VLM
275
73
0
05 Sep 2022
Injecting Image Details into CLIP's Feature Space
Zilun Zhang
Cuifeng Shen
Yuan-Chung Shen
Huixin Xiong
Xinyu Zhou
VLM
CLIP
167
0
0
31 Aug 2022
Towards Efficient Use of Multi-Scale Features in Transformer-Based Object Detectors
Computer Vision and Pattern Recognition (CVPR), 2022
Gongjie Zhang
Zhipeng Luo
Zichen Tian
Yingchen Yu
Jingyi Zhang
Shijian Lu
228
40
0
24 Aug 2022
Open-Vocabulary Universal Image Segmentation with MaskCLIP
International Conference on Machine Learning (ICML), 2022
Zheng Ding
Jieke Wang
Zhuowen Tu
CLIP
ISeg
VLM
228
124
0
18 Aug 2022
What Artificial Neural Networks Can Tell Us About Human Language Acquisition
Alex Warstadt
Samuel R. Bowman
209
133
0
17 Aug 2022
PPMN: Pixel-Phrase Matching Network for One-Stage Panoptic Narrative Grounding
ACM Multimedia (ACM MM), 2022
Zihan Ding
Zixiang Ding
Tianrui Hui
Junshi Huang
Xiaoming Wei
Xiaolin K. Wei
Si Liu
162
15
0
11 Aug 2022
Masked Vision and Language Modeling for Multi-modal Representation Learning
International Conference on Learning Representations (ICLR), 2022
Gukyeong Kwon
Zhaowei Cai
Avinash Ravichandran
Erhan Bas
Rahul Bhotika
Stefano Soatto
187
82
0
03 Aug 2022
One for All: One-stage Referring Expression Comprehension with Dynamic Reasoning
Neurocomputing (Neurocomputing), 2022
Zhipeng Zhang
Zhimin Wei
Zhongzhen Huang
Rui Niu
Peng Wang
ObjD
LRM
212
9
0
31 Jul 2022
Fine-grained Retrieval Prompt Tuning
AAAI Conference on Artificial Intelligence (AAAI), 2022
Shijie Wang
Jianlong Chang
Zhihui Wang
Haojie Li
Wanli Ouyang
Qi Tian
VLM
VPVLM
134
23
0
29 Jul 2022
Pro-tuning: Unified Prompt Tuning for Vision Tasks
Xing Nie
Bolin Ni
Jianlong Chang
Gaomeng Meng
Chunlei Huo
Zhaoxiang Zhang
Shiming Xiang
Qi Tian
Chunhong Pan
AAML
VPVLM
VLM
330
95
0
28 Jul 2022
SiRi: A Simple Selective Retraining Mechanism for Transformer-based Visual Grounding
European Conference on Computer Vision (ECCV), 2022
Mengxue Qu
Yu Wu
Wu Liu
Qiqi Gong
Xiaodan Liang
Olga Russakovsky
Yao Zhao
Yunchao Wei
ObjD
105
26
0
27 Jul 2022
DETRs with Hybrid Matching
Computer Vision and Pattern Recognition (CVPR), 2022
Ding Jia
Yuhui Yuan
Hao He
Xiao-pei Wu
Haojun Yu
Weihong Lin
Lei-huan Sun
Chao Zhang
Hanhua Hu
328
247
0
26 Jul 2022
Multi-Attention Network for Compressed Video Referring Object Segmentation
ACM Multimedia (ACM MM), 2022
Weidong Chen
Dexiang Hong
Yuankai Qi
Zhenjun Han
Shuhui Wang
Laiyun Qing
Qingming Huang
Guorong Li
VOS
120
51
0
26 Jul 2022
Correspondence Matters for Video Referring Expression Comprehension
ACM Multimedia (ACM MM), 2022
Meng Cao
Ji Jiang
Long Chen
Yuexian Zou
VOS
249
21
0
21 Jul 2022
Exploiting Unlabeled Data with Vision and Language Models for Object Detection
European Conference on Computer Vision (ECCV), 2022
Shiyu Zhao
Zhixing Zhang
S. Schulter
Long Zhao
Vijay Kumar B.G
Anastasis Stathopoulos
Manmohan Chandraker
Dimitris N. Metaxas
VLM
ObjD
155
121
0
18 Jul 2022
3D Concept Grounding on Neural Fields
Neural Information Processing Systems (NeurIPS), 2022
Yining Hong
Yilun Du
Chun-Tse Lin
J. Tenenbaum
Chuang Gan
179
23
0
13 Jul 2022
Inner Monologue: Embodied Reasoning through Planning with Language Models
Conference on Robot Learning (CoRL), 2022
Wenlong Huang
F. Xia
Ted Xiao
Harris Chan
Jacky Liang
...
Tomas Jackson
Linda Luu
Sergey Levine
Karol Hausman
Brian Ichter
LLMAG
LM&Ro
LRM
352
1,135
0
12 Jul 2022
Video Graph Transformer for Video Question Answering
European Conference on Computer Vision (ECCV), 2022
Junbin Xiao
Pan Zhou
Tat-Seng Chua
Shuicheng Yan
ViT
451
94
0
12 Jul 2022
Bridging the Gap between Object and Image-level Representations for Open-Vocabulary Detection
Neural Information Processing Systems (NeurIPS), 2022
H. Rasheed
Muhammad Maaz
Muhammad Uzair Khattak
Salman Khan
Fahad Shahbaz Khan
ObjD
VLM
297
181
0
07 Jul 2022
STVGFormer: Spatio-Temporal Video Grounding with Static-Dynamic Cross-Modal Understanding
Zihang Lin
Chaolei Tan
Jianfang Hu
Zhi Jin
Tiancai Ye
Weihao Zheng
181
4
0
06 Jul 2022
Toward Explainable and Fine-Grained 3D Grounding through Referring Textual Phrases
Zhihao Yuan
Xu Yan
Zhuo Li
Xuhao Li
Yao Guo
Shuguang Cui
Zhen Li
144
18
0
05 Jul 2022
VL-CheckList: Evaluating Pre-trained Vision-Language Models with Objects, Attributes and Relations
Tiancheng Zhao
Tianqi Zhang
Mingwei Zhu
Haozhan Shen
Kyusong Lee
Xiaopeng Lu
Jianwei Yin
VLM
CoGe
MLLM
258
110
0
01 Jul 2022
EBMs vs. CL: Exploring Self-Supervised Visual Pretraining for Visual Question Answering
Violetta Shevchenko
Ehsan Abbasnejad
A. Dick
Anton Van Den Hengel
Damien Teney
187
0
0
29 Jun 2022
DALL-E for Detection: Language-driven Compositional Image Synthesis for Object Detection
Yunhao Ge
Lyne Tchapmi
Brian Nlong Zhao
Neel Joshi
Laurent Itti
Vibhav Vineet
DiffM
ObjD
290
23
0
20 Jun 2022
What is Where by Looking: Weakly-Supervised Open-World Phrase-Grounding without Text Inputs
Neural Information Processing Systems (NeurIPS), 2022
Tal Shaharabany
Yoad Tewel
Lior Wolf
ObjD
214
22
0
19 Jun 2022
Unified-IO: A Unified Model for Vision, Language, and Multi-Modal Tasks
International Conference on Learning Representations (ICLR), 2022
Jiasen Lu
Christopher Clark
Rowan Zellers
Roozbeh Mottaghi
Aniruddha Kembhavi
ObjD
VLM
MLLM
345
470
0
17 Jun 2022
BridgeTower: Building Bridges Between Encoders in Vision-Language Representation Learning
AAAI Conference on Artificial Intelligence (AAAI), 2022
Xiao Xu
Chenfei Wu
Shachar Rosenman
Vasudev Lal
Wanxiang Che
Nan Duan
160
90
0
17 Jun 2022
MixGen: A New Multi-Modal Data Augmentation
Xiaoshuai Hao
Yi Zhu
Srikar Appalaraju
Aston Zhang
Wanqian Zhang
Boyang Li
Mu Li
VLM
289
120
0
16 Jun 2022
SAVi++: Towards End-to-End Object-Centric Learning from Real-World Videos
Neural Information Processing Systems (NeurIPS), 2022
Gamaleldin F. Elsayed
Aravindh Mahendran
Sjoerd van Steenkiste
Klaus Greff
Michael C. Mozer
Thomas Kipf
VOS
OCL
317
167
0
15 Jun 2022
Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone
Neural Information Processing Systems (NeurIPS), 2022
Zi-Yi Dou
Aishwarya Kamath
Zhe Gan
Pengchuan Zhang
Jianfeng Wang
...
Ce Liu
Yann LeCun
Nanyun Peng
Jianfeng Gao
Lijuan Wang
VLM
ObjD
218
149
0
15 Jun 2022
ReCo: Retrieve and Co-segment for Zero-shot Transfer
Neural Information Processing Systems (NeurIPS), 2022
Gyungin Shin
Weidi Xie
Samuel Albanie
VLM
315
119
0
14 Jun 2022
TransVG++: End-to-End Visual Grounding with Language Conditioned Vision Transformer
IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2022
Jiajun Deng
Zhengyuan Yang
Daqing Liu
Tianlang Chen
Wen-gang Zhou
Yanyong Zhang
Houqiang Li
Wanli Ouyang
ViT
200
86
0
14 Jun 2022
INDIGO: Intrinsic Multimodality for Domain Generalization
Puneet Mangla
Shivam Chandhok
Milan Aggarwal
V. Balasubramanian
Balaji Krishnamurthy
VLM
148
3
0
13 Jun 2022
GLIPv2: Unifying Localization and Vision-Language Understanding
Haotian Zhang
Pengchuan Zhang
Xiaowei Hu
Yen-Chun Chen
Liunian Harold Li
Xiyang Dai
Lijuan Wang
Lu Yuan
Lei Li
Jianfeng Gao
ObjD
VLM
241
352
0
12 Jun 2022
Referring Image Matting
Computer Vision and Pattern Recognition (CVPR), 2022
Jizhizi Li
Jing Zhang
Dacheng Tao
ObjD
VLM
180
31
0
10 Jun 2022
VLUE: A Multi-Task Benchmark for Evaluating Vision-Language Models
Wangchunshu Zhou
Yan Zeng
Shizhe Diao
Xinsong Zhang
CoGe
VLM
260
14
0
30 May 2022
Visual Superordinate Abstraction for Robust Concept Learning
Machine Intelligence Research (MIR), 2022
Qinjie Zheng
Chaoyue Wang
Dadong Wang
Dacheng Tao
VLM
148
3
0
28 May 2022
mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal Skip-connections
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2022
Chenliang Li
Haiyang Xu
Junfeng Tian
Wei Wang
Ming Yan
...
Ji Zhang
Songfang Huang
Feiran Huang
Jingren Zhou
Luo Si
VLM
MLLM
201
264
0
24 May 2022
Wireless Ad Hoc Federated Learning: A Fully Distributed Cooperative Machine Learning
H. Ochiai
Yuwei Sun
Qingzhe Jin
Nattanon Wongwiwatchai
Hiroshi Esaki
125
24
0
24 May 2022
PEVL: Position-enhanced Pre-training and Prompt Tuning for Vision-language Models
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2022
Yuan Yao
Qi-An Chen
Ao Zhang
Wei Ji
Zhiyuan Liu
Tat-Seng Chua
Maosong Sun
VLM
MLLM
217
43
0
23 May 2022
Training Vision-Language Transformers from Captions
Liangke Gui
Yingshan Chang
Qiuyuan Huang
Subhojit Som
Alexander G. Hauptmann
Jianfeng Gao
Yonatan Bisk
VLM
ViT
342
11
0
19 May 2022
Simple Open-Vocabulary Object Detection with Vision Transformers
Matthias Minderer
A. Gritsenko
Austin Stone
Maxim Neumann
Dirk Weissenborn
...
Zhuoran Shen
Tianlin Li
Xiaohua Zhai
Thomas Kipf
N. Houlsby
ObjD
CLIP
VLM
ViT
OCL
254
362
0
12 May 2022
Weakly-supervised segmentation of referring expressions
Robin Strudel
Ivan Laptev
Cordelia Schmid
209
28
0
10 May 2022
Beyond Bounding Box: Multimodal Knowledge Learning for Object Detection
Wei Feng
Xingyuan Bu
Chenchen Zhang
Xubin Li
VLM
113
5
0
09 May 2022
Declaration-based Prompt Tuning for Visual Question Answering
International Joint Conference on Artificial Intelligence (IJCAI), 2022
Yuhang Liu
Wei Wei
Daowan Peng
Feida Zhu
MLLM
VLM
98
21
0
05 May 2022
Answer-Me: Multi-Task Open-Vocabulary Visual Question Answering
A. Piergiovanni
Wei Li
Weicheng Kuo
M. Saffar
Fred Bertsch
A. Angelova
207
18
0
02 May 2022
A Multi-level Alignment Training Scheme for Video-and-Language Grounding
Yubo Zhang
Feiyang Niu
Q. Ping
Govind Thattai
CVBM
165
2
0
22 Apr 2022
Self-paced Multi-grained Cross-modal Interaction Modeling for Referring Expression Comprehension
IEEE Transactions on Image Processing (IEEE TIP), 2022
Peihan Miao
Wei Su
Gaoang Wang
Xuewei Li
Xi Li
ObjD
246
12
0
21 Apr 2022
Previous
1
2
3
...
11
12
13
14
Next