Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales
Search
Open menu
Home
Papers
2104.12763
Cited By
v1
v2 (latest)
MDETR -- Modulated Detection for End-to-End Multi-Modal Understanding
IEEE International Conference on Computer Vision (ICCV), 2021
26 April 2021
Aishwarya Kamath
Mannat Singh
Yann LeCun
Gabriel Synnaeve
Ishan Misra
Nicolas Carion
ObjD
VLM
Re-assign community
ArXiv (abs)
PDF
HTML
Github (1008★)
Papers citing
"MDETR -- Modulated Detection for End-to-End Multi-Modal Understanding"
50 / 678 papers shown
Towards Grounded Visual Spatial Reasoning in Multi-Modal Vision Language Models
Navid Rajabi
Jana Kosecka
VLM
275
17
0
18 Aug 2023
RLIPv2: Fast Scaling of Relational Language-Image Pre-training
IEEE International Conference on Computer Vision (ICCV), 2023
Hangjie Yuan
Shiwei Zhang
Xiang Wang
Samuel Albanie
Yining Pan
Tao Feng
Jianwen Jiang
Dong Ni
Yingya Zhang
Deli Zhao
VLM
244
62
0
18 Aug 2023
Tem-adapter: Adapting Image-Text Pretraining for Video Question Answer
IEEE International Conference on Computer Vision (ICCV), 2023
Guangyi Chen
Xiao Liu
Guangrun Wang
Kun Zhang
Philip H.S.Torr
Xiaoping Zhang
Yansong Tang
298
27
0
16 Aug 2023
Helping Hands: An Object-Aware Ego-Centric Video Recognition Model
IEEE International Conference on Computer Vision (ICCV), 2023
Chuhan Zhang
Ankush Gupta
Andrew Zisserman
VLM
222
35
0
15 Aug 2023
Taming Self-Training for Open-Vocabulary Object Detection
Computer Vision and Pattern Recognition (CVPR), 2023
Shiyu Zhao
S. Schulter
Long Zhao
Zhixing Zhang
Vijay Kumar B.G
Yumin Suh
Manmohan Chandraker
Dimitris N. Metaxas
VLM
ObjD
364
21
0
11 Aug 2023
Exploring Visual Pre-training for Robot Manipulation: Datasets, Models and Methods
IEEE/RJS International Conference on Intelligent RObots and Systems (IROS), 2023
Ya Jing
Xuelin Zhu
Xingbin Liu
Qie Sima
Taozheng Yang
Yunhai Feng
Tao Kong
LM&Ro
210
18
0
07 Aug 2023
The All-Seeing Project: Towards Panoptic Visual Recognition and Understanding of the Open World
International Conference on Learning Representations (ICLR), 2023
Weiyun Wang
Min Shi
Qingyun Li
Wen Wang
Zhenhang Huang
...
Zhiguo Cao
Yushi Chen
Tong Lu
Jifeng Dai
Yu Qiao
LRM
MLLM
270
118
0
03 Aug 2023
Grounded Image Text Matching with Mismatched Relation Reasoning
IEEE International Conference on Computer Vision (ICCV), 2023
Yu Wu
Yan-Tao Wei
Haozhe Jasper Wang
Yongfei Liu
Sibei Yang
Xuming He
257
13
0
02 Aug 2023
Towards General Visual-Linguistic Face Forgery Detection
Computer Vision and Pattern Recognition (CVPR), 2023
Ke Sun
Shen Chen
Taiping Yao
Haozhe Yang
Xiaoshuai Sun
Shouhong Ding
Rongrong Ji
265
33
0
31 Jul 2023
Bridging the Gap: Exploring the Capabilities of Bridge-Architectures for Complex Visual Reasoning Tasks
Kousik Rajesh
Mrigank Raman
M. A. Karim
Pranit Chawla
VLM
204
2
0
31 Jul 2023
JOTR: 3D Joint Contrastive Learning with Transformers for Occluded Human Mesh Recovery
IEEE International Conference on Computer Vision (ICCV), 2023
Jiahao Li
Zongxin Yang
Xiaohan Wang
Jianxin Ma
Chang Zhou
Yi Yang
256
19
0
31 Jul 2023
UnIVAL: Unified Model for Image, Video, Audio and Language Tasks
Mustafa Shukor
Corentin Dancette
Alexandre Ramé
Matthieu Cord
MoMe
MLLM
308
54
0
30 Jul 2023
Scaling Up and Distilling Down: Language-Guided Robot Skill Acquisition
Conference on Robot Learning (CoRL), 2023
Huy Ha
Peter R. Florence
Shuran Song
LM&Ro
273
208
0
26 Jul 2023
Foundational Models Defining a New Era in Vision: A Survey and Outlook
Muhammad Awais
Muzammal Naseer
Salman Khan
Rao Muhammad Anwer
Hisham Cholakkal
M. Shah
Ming-Hsuan Yang
Fahad Shahbaz Khan
VLM
430
152
0
25 Jul 2023
3DRP-Net: 3D Relative Position-aware Network for 3D Visual Grounding
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2023
Zehan Wang
Haifeng Huang
Yang Zhao
Lin Li
Xize Cheng
Yichen Zhu
Aoxiong Yin
Zhou Zhao
3DPC
176
29
0
25 Jul 2023
Audio-aware Query-enhanced Transformer for Audio-Visual Segmentation
Jinxian Liu
Chen Ju
Chaofan Ma
Yanfeng Wang
Yu Wang
Ya Zhang
VOS
268
37
0
25 Jul 2023
Described Object Detection: Liberating Object Detection with Flexible Expressions
Neural Information Processing Systems (NeurIPS), 2023
Chi Xie
Zhao Zhang
YiXuan Wu
Feng Zhu
Rui Zhao
Shuang Liang
ObjD
242
50
0
24 Jul 2023
Iterative Robust Visual Grounding with Masked Reference based Centerpoint Supervision
Menghao Li
Chunlei Wang
W. Feng
Shuchang Lyu
Guangliang Cheng
Xiangtai Li
Binghao Liu
Qi Zhao
275
7
0
23 Jul 2023
Advancing Visual Grounding with Scene Knowledge: Benchmark and Method
Computer Vision and Pattern Recognition (CVPR), 2023
Zhihong Chen
Ruifei Zhang
Yibing Song
Xiang Wan
Guanbin Li
175
30
0
21 Jul 2023
Bridging Vision and Language Encoders: Parameter-Efficient Tuning for Referring Image Segmentation
IEEE International Conference on Computer Vision (ICCV), 2023
Zunnan Xu
Zhihong Chen
Yong Zhang
Yibing Song
Xiang Wan
Guanbin Li
VLM
228
70
0
21 Jul 2023
Divert More Attention to Vision-Language Object Tracking
IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2023
Mingzhe Guo
Zhipeng Zhang
Li Jing
Haibin Ling
Heng Fan
VLM
261
14
0
19 Jul 2023
Distilling Coarse-to-Fine Semantic Matching Knowledge for Weakly Supervised 3D Visual Grounding
IEEE International Conference on Computer Vision (ICCV), 2023
Zehan Wang
Haifeng Huang
Yang Zhao
Lin Li
Xize Cheng
Yichen Zhu
Aoxiong Yin
Zhou Zhao
188
29
0
18 Jul 2023
A Survey on Open-Vocabulary Detection and Segmentation: Past, Present, and Future
IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2023
Chaoyang Zhu
Long Chen
ObjD
VLM
510
68
0
18 Jul 2023
Multimodal Diffusion Segmentation Model for Object Segmentation from Manipulation Instructions
IEEE/RJS International Conference on Intelligent RObots and Systems (IROS), 2023
Yui Iioka
Y. Yoshida
Yuiga Wada
Shumpei Hatanaka
K. Sugiura
DiffM
227
7
0
17 Jul 2023
BUS:Efficient and Effective Vision-language Pre-training with Bottom-Up Patch Summarization
IEEE International Conference on Computer Vision (ICCV), 2023
Chaoya Jiang
Haiyang Xu
Wei Ye
Qinghao Ye
Chenliang Li
Mingshi Yan
Bin Bi
Shikun Zhang
Fei Huang
Songfang Huang
VLM
190
9
0
17 Jul 2023
Bootstrapping Vision-Language Learning with Decoupled Language Pre-training
Neural Information Processing Systems (NeurIPS), 2023
Yiren Jian
Chongyang Gao
Soroush Vosoughi
VLM
MLLM
388
44
0
13 Jul 2023
VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models
Conference on Robot Learning (CoRL), 2023
Wenlong Huang
Chen Wang
Ruohan Zhang
Yunzhu Li
Jiajun Wu
Li Fei-Fei
LM&Ro
407
750
0
12 Jul 2023
GVCCI: Lifelong Learning of Visual Grounding for Language-Guided Robotic Manipulation
IEEE/RJS International Conference on Intelligent RObots and Systems (IROS), 2023
Junghyun Kim
Gi-Cheon Kang
Suhyung Choi
Suyeon Shin
Byoung-Tak Zhang
LM&Ro
211
9
0
12 Jul 2023
Prototypical Contrastive Transfer Learning for Multimodal Language Understanding
IEEE/RJS International Conference on Intelligent RObots and Systems (IROS), 2023
Seitaro Otsuki
Shintaro Ishikawa
K. Sugiura
186
1
0
12 Jul 2023
GPT4RoI: Instruction Tuning Large Language Model on Region-of-Interest
Shilong Zhang
Pei Sun
Shoufa Chen
Min Xiao
Wenqi Shao
Wenwei Zhang
Yu Liu
Kai-xiang Chen
Ping Luo
MLLM
VLM
913
317
0
07 Jul 2023
Vision Language Transformers: A Survey
Clayton Fields
C. Kennington
VLM
182
7
0
06 Jul 2023
Distilling Large Vision-Language Model with Out-of-Distribution Generalizability
IEEE International Conference on Computer Vision (ICCV), 2023
Xuanlin Li
Yunhao Fang
Minghua Liu
Z. Ling
Zhuowen Tu
Haoran Su
VLM
344
43
0
06 Jul 2023
Human Inspired Progressive Alignment and Comparative Learning for Grounded Word Acquisition
Annual Meeting of the Association for Computational Linguistics (ACL), 2023
Yuwei Bao
B. Lattimer
J. Chai
CLL
279
1
0
05 Jul 2023
Robots That Ask For Help: Uncertainty Alignment for Large Language Model Planners
Conference on Robot Learning (CoRL), 2023
Allen Z. Ren
Anushri Dixit
Alexandra Bodrova
Sumeet Singh
Stephen Tu
...
Jacob Varley
Zhenjia Xu
Dorsa Sadigh
Andy Zeng
Anirudha Majumdar
LM&Ro
487
307
0
04 Jul 2023
AVSegFormer: Audio-Visual Segmentation with Transformer
AAAI Conference on Artificial Intelligence (AAAI), 2023
Sheng Gao
Zhe Chen
Guo Chen
Wenhai Wang
Tong Lu
VOS
372
80
0
03 Jul 2023
CoPL: Contextual Prompt Learning for Vision-Language Understanding
AAAI Conference on Artificial Intelligence (AAAI), 2023
Koustava Goswami
Srikrishna Karanam
Prateksha Udhayanan
J. JosephK.
Balaji Vasan Srinivasan
VLM
270
18
0
03 Jul 2023
Statler: State-Maintaining Language Models for Embodied Reasoning
IEEE International Conference on Robotics and Automation (ICRA), 2023
Takuma Yoneda
Jiading Fang
Peng Li
Huanyu Zhang
Tianchong Jiang
Shengjie Lin
Ben Picker
David Yunis
Hongyuan Mei
Matthew R. Walter
LM&Ro
269
47
0
30 Jun 2023
Look, Remember and Reason: Grounded reasoning in videos with language models
International Conference on Learning Representations (ICLR), 2023
Apratim Bhattacharyya
Sunny Panchal
Mingu Lee
Reza Pourreza
Pulkit Madan
Roland Memisevic
LRM
470
13
0
30 Jun 2023
Towards Open Vocabulary Learning: A Survey
IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2023
Jianzong Wu
Xiangtai Li
Shilin Xu
Haobo Yuan
Henghui Ding
...
Jiangning Zhang
Yu Tong
Xudong Jiang
Guohao Li
Dacheng Tao
ObjD
VLM
406
218
0
28 Jun 2023
REFLECT: Summarizing Robot Experiences for Failure Explanation and Correction
Conference on Robot Learning (CoRL), 2023
Zeyi Liu
Arpit Bahety
Shuran Song
LRM
459
189
0
27 Jun 2023
Kosmos-2: Grounding Multimodal Large Language Models to the World
International Conference on Learning Representations (ICLR), 2023
Zhiliang Peng
Wenhui Wang
Li Dong
Y. Hao
Shaohan Huang
Shuming Ma
Furu Wei
MLLM
ObjD
VLM
396
1,026
0
26 Jun 2023
Switch-BERT: Learning to Model Multimodal Interactions by Switching Attention and Input
European Conference on Computer Vision (ECCV), 2023
Qingpei Guo
Kaisheng Yao
Wei Chu
MLLM
103
6
0
25 Jun 2023
DesCo: Learning Object Recognition with Rich Language Descriptions
Neural Information Processing Systems (NeurIPS), 2023
Liunian Harold Li
Zi-Yi Dou
Nanyun Peng
Kai-Wei Chang
ObjD
VLM
185
28
0
24 Jun 2023
RS5M and GeoRSCLIP: A Large Scale Vision-Language Dataset and A Large Vision-Language Model for Remote Sensing
IEEE Transactions on Geoscience and Remote Sensing (TGRS), 2023
Zilun Zhang
Tiancheng Zhao
Yulong Guo
Yuxiang Cai
DiffM
VLM
1.2K
153
0
20 Jun 2023
Visually-Guided Sound Source Separation with Audio-Visual Predictive Coding
IEEE Transactions on Neural Networks and Learning Systems (TNNLS), 2023
Zengjie Song
Zhaoxiang Zhang
169
5
0
19 Jun 2023
CorNav: Autonomous Agent with Self-Corrected Planning for Zero-Shot Vision-and-Language Navigation
Annual Meeting of the Association for Computational Linguistics (ACL), 2023
Xiwen Liang
Liang Ma
Shanshan Guo
Jianhua Han
Hang Xu
Shikui Ma
Xiaodan Liang
LM&Ro
LLMAG
367
5
0
17 Jun 2023
Scaling Open-Vocabulary Object Detection
Neural Information Processing Systems (NeurIPS), 2023
Matthias Minderer
A. Gritsenko
N. Houlsby
VLM
ObjD
418
309
0
16 Jun 2023
Recurrent Action Transformer with Memory
A. Staroverov
A. Bessonov
Dmitry A. Yudin
A. Kovalev
Aleksandr I. Panov
OffRL
393
13
0
15 Jun 2023
Exploring the Application of Large-scale Pre-trained Models on Adverse Weather Removal
IEEE Transactions on Image Processing (IEEE TIP), 2023
Zhentao Tan
Yue-bo Wu
Qiankun Liu
Qi Chu
Le Lu
Jieping Ye
Nenghai Yu
230
24
0
15 Jun 2023
World-to-Words: Grounded Open Vocabulary Acquisition through Fast Mapping in Vision-Language Models
Annual Meeting of the Association for Computational Linguistics (ACL), 2023
Ziqiao Ma
Jiayi Pan
J. Chai
ObjD
VLM
196
12
0
14 Jun 2023
Previous
1
2
3
...
7
8
9
...
12
13
14
Next