Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales
Search
Open menu
Home
Papers
2104.12763
Cited By
v1
v2 (latest)
MDETR -- Modulated Detection for End-to-End Multi-Modal Understanding
IEEE International Conference on Computer Vision (ICCV), 2021
26 April 2021
Aishwarya Kamath
Mannat Singh
Yann LeCun
Gabriel Synnaeve
Ishan Misra
Nicolas Carion
ObjD
VLM
Re-assign community
ArXiv (abs)
PDF
HTML
Github (1008★)
Papers citing
"MDETR -- Modulated Detection for End-to-End Multi-Modal Understanding"
50 / 671 papers shown
Title
Lost in Space: Probing Fine-grained Spatial Understanding in Vision and Language Resamplers
Georgios Pantazopoulos
Alessandro Suglia
Oliver Lemon
Arash Eshghi
VLM
167
8
0
21 Apr 2024
HiVG: Hierarchical Multimodal Fine-grained Modulation for Visual Grounding
Linhui Xiao
Xiaoshan Yang
Fang Peng
Yaowei Wang
Changsheng Xu
ObjD
275
30
0
20 Apr 2024
Groma: Localized Visual Tokenization for Grounding Multimodal Large Language Models
Chuofan Ma
Yi Jiang
Jiannan Wu
Zehuan Yuan
Xiaojuan Qi
VLM
ObjD
177
98
0
19 Apr 2024
Zero-Shot Medical Phrase Grounding with Off-the-shelf Diffusion Models
Konstantinos Vilouras
Pedro Sanchez
Alison Q. OÑeil
Sotirios A. Tsaftaris
MedIm
473
9
0
19 Apr 2024
MLS-Track: Multilevel Semantic Interaction in RMOT
Zeliang Ma
Yang Song
Zhe Cui
Zhicheng Zhao
Fei Su
Delong Liu
Jingyu Wang
183
7
0
18 Apr 2024
HOI-Ref: Hand-Object Interaction Referral in Egocentric Vision
Siddhant Bansal
Michael Wray
Dima Damen
162
10
0
15 Apr 2024
DetCLIPv3: Towards Versatile Generative Open-vocabulary Object Detection
Lewei Yao
Renjie Pi
Jianhua Han
Xiaodan Liang
Hang Xu
Wei Zhang
Zhenguo Li
Dan Xu
VLM
ObjD
180
42
0
14 Apr 2024
Enhancing Visual Question Answering through Question-Driven Image Captions as Prompts
Övgü Özdemir
Erdem Akagündüz
240
18
0
12 Apr 2024
Ferret-v2: An Improved Baseline for Referring and Grounding with Large Language Models
Haotian Zhang
Haoxuan You
Philipp Dufter
Bowen Zhang
Chen Chen
...
Tsu-Jui Fu
William Y. Wang
Shih-Fu Chang
Zhe Gan
Yinfei Yang
ObjD
MLLM
225
80
0
11 Apr 2024
Learning to Localize Objects Improves Spatial Reasoning in Visual-LLMs
Kanchana Ranasinghe
Satya Narayan Shukla
Omid Poursaeed
Michael S. Ryoo
Tsung-Yu Lin
LRM
157
55
0
11 Apr 2024
Hyperbolic Learning with Synthetic Captions for Open-World Detection
Fanjie Kong
Yanbei Chen
Jiarui Cai
Davide Modolo
VLM
ObjD
170
12
0
07 Apr 2024
3DStyleGLIP: Part-Tailored Text-Guided 3D Neural Stylization
Seung-bum Chung
Joohyun Park
Hyewon Kan
Hyeongyeop Kang
CLIP
179
6
0
03 Apr 2024
Text-driven Affordance Learning from Egocentric Vision
Tomoya Yoshida
Shuhei Kurita
Taichi Nishimura
Shinsuke Mori
215
6
0
03 Apr 2024
LocCa: Visual Pretraining with Location-aware Captioners
Bo Wan
Michael Tschannen
Yongqin Xian
Filip Pavetić
Ibrahim Alabdulmohsin
Xiao Wang
André Susano Pinto
Andreas Steiner
Lucas Beyer
Xiao-Qi Zhai
VLM
300
19
0
28 Mar 2024
J-CRe3: A Japanese Conversation Dataset for Real-world Reference Resolution
Nobuhiro Ueda
Hideko Habe
Yoko Matsui
Akishige Yuguchi
Seiya Kawano
Yasutomo Kawanishi
Sadao Kurohashi
Koichiro Yoshino
126
7
0
28 Mar 2024
Online Embedding Multi-Scale CLIP Features into 3D Maps
Shun Taguchi
Hideki Deguchi
117
0
0
27 Mar 2024
ReMamber: Referring Image Segmentation with Mamba Twister
Yu-Hao Yang
Chaofan Ma
Jiangchao Yao
Zhun Zhong
Ya Zhang
Yanfeng Wang
Mamba
214
47
0
26 Mar 2024
OVER-NAV: Elevating Iterative Vision-and-Language Navigation with Open-Vocabulary Detection and StructurEd Representation
Ganlong Zhao
Guanbin Li
Weikai Chen
Yizhou Yu
212
13
0
26 Mar 2024
Data-Efficient 3D Visual Grounding via Order-Aware Referring
Tung-Yu Wu
Sheng-Yu Huang
Yu-Chiang Frank Wang
436
3
0
25 Mar 2024
T-Rex2: Towards Generic Object Detection via Text-Visual Prompt Synergy
Qing Jiang
Feng Li
Zhaoyang Zeng
Tianhe Ren
Shilong Liu
Lei Zhang
VLM
243
77
0
21 Mar 2024
IVAC-P2L: Leveraging Irregular Repetition Priors for Improving Video Action Counting
Hang Wang
Zhi-Qi Cheng
Youtian Du
Lei Zhang
176
2
0
18 Mar 2024
Generative Region-Language Pretraining for Open-Ended Object Detection
Computer Vision and Pattern Recognition (CVPR), 2024
Chuang Lin
Yi Jiang
Zhuang Li
Zehuan Yuan
Jianfei Cai
ObjD
VLM
178
27
0
15 Mar 2024
GiT: Towards Generalist Vision Transformer through Universal Language Interface
European Conference on Computer Vision (ECCV), 2024
Haiyang Wang
Hao Tang
Li Jiang
Shaoshuai Shi
Muhammad Ferjad Naeem
Jiaming Song
Bernt Schiele
Liwei Wang
VLM
238
21
0
14 Mar 2024
Griffon v2: Advancing Multimodal Perception with High-Resolution Scaling and Visual-Language Co-Referring
Yufei Zhan
Yousong Zhu
Hongyin Zhao
Fan Yang
Fan Yang
Jinqiao Wang
Jinqiao Wang
ObjD
233
25
0
14 Mar 2024
TaskCLIP: Extend Large Vision-Language Model for Task Oriented Object Detection
Hanning Chen
Wenjun Huang
Yang Ni
Sanggeon Yun
Fei Wen
Hugo Latapie
Mohsen Imani
ObjD
MLLM
VLM
178
26
0
12 Mar 2024
TutoAI: A Cross-domain Framework for AI-assisted Mixed-media Tutorial Creation on Physical Tasks
International Conference on Human Factors in Computing Systems (CHI), 2024
Yuexi Chen
Vlad I. Morariu
Anh Truong
Zhicheng Liu
DiffM
VGen
202
9
0
12 Mar 2024
Lumen: Unleashing Versatile Vision-Centric Capabilities of Large Multimodal Models
Neural Information Processing Systems (NeurIPS), 2024
Yang Jiao
Shaoxiang Chen
Zequn Jie
Wenke Huang
Lin Ma
Yueping Jiang
MLLM
241
23
0
12 Mar 2024
Real-time Transformer-based Open-Vocabulary Detection with Efficient Fusion Head
Tiancheng Zhao
Peng Liu
Xuan He
Lu Zhang
Kyusong Lee
ObjD
125
18
0
11 Mar 2024
Discriminative Probing and Tuning for Text-to-Image Generation
Leigang Qu
Wenjie Wang
Chak Tou Leong
Hanwang Zhang
Liqiang Nie
Tat-Seng Chua
265
12
0
07 Mar 2024
Detecting Concrete Visual Tokens for Multimodal Machine Translation
Braeden Bowen
Vipin Vijayan
Scott Grigsby
Timothy Anderson
Jeremy Gwinnup
211
5
0
05 Mar 2024
Enhancing Vision-Language Pre-training with Rich Supervisions
Yuan Gao
Kunyu Shi
Pengkai Zhu
Edouard Belval
Oren Nuriel
Srikar Appalaraju
Shabnam Ghadar
Vijay Mahadevan
Zhuowen Tu
Stefano Soatto
VLM
CLIP
330
15
0
05 Mar 2024
RegionGPT: Towards Region Understanding Vision Language Model
Qiushan Guo
Shalini De Mello
Hongxu Yin
Wonmin Byeon
Ka Chun Cheung
Yizhou Yu
Ping Luo
Sifei Liu
VLM
162
65
0
04 Mar 2024
Contrastive Region Guidance: Improving Grounding in Vision-Language Models without Training
David Wan
Jaemin Cho
Elias Stengel-Eskin
Mohit Bansal
VLM
ObjD
251
48
0
04 Mar 2024
Non-autoregressive Sequence-to-Sequence Vision-Language Models
Kunyu Shi
Qi Dong
Luis Goncalves
Zhuowen Tu
Stefano Soatto
VLM
273
4
0
04 Mar 2024
Spurious Feature Eraser: Stabilizing Test-Time Adaptation for Vision-Language Foundation Model
Huan Ma
Yan Zhu
Changqing Zhang
Peilin Zhao
Baoyuan Wu
Long-Kai Huang
Qinghua Hu
Bing Wu
VLM
443
3
0
01 Mar 2024
GROUNDHOG: Grounding Large Language Models to Holistic Segmentation
Yichi Zhang
Ziqiao Ma
Xiaofeng Gao
Suhaila Shakiah
Qiaozi Gao
Joyce Chai
MLLM
VLM
315
72
0
26 Feb 2024
Beyond Literal Descriptions: Understanding and Locating Open-World Objects Aligned with Human Intentions
Wenxuan Wang
Yisi Zhang
Xingjian He
Yichen Yan
Zijia Zhao
Xinlong Wang
Jing Liu
LM&Ro
211
5
0
17 Feb 2024
Real-World Robot Applications of Foundation Models: A Review
Kento Kawaharazuka
T. Matsushima
Andrew Gambardella
Jiaxian Guo
Chris Paxton
Andy Zeng
OffRL
VLM
LM&Ro
229
85
0
08 Feb 2024
LLMs Meet VLMs: Boost Open Vocabulary Object Detection with Fine-grained Descriptors
Sheng Jin
Xue-Qiu Jiang
Jiaxing Huang
Lewei Lu
Shijian Lu
VLM
ObjD
148
38
0
07 Feb 2024
Enhancing Embodied Object Detection through Language-Image Pre-training and Implicit Object Memory
N. H. Chapman
Feras Dayoub
Will N. Browne
Chris Lehnert
ObjD
VLM
LM&Ro
163
2
0
06 Feb 2024
Phrase Grounding-based Style Transfer for Single-Domain Generalized Object Detection
Hao Li
Wei Wang
Cong Wang
Zhigang Luo
Xinwang Liu
KenLi Li
Xiaochun Cao
ObjD
211
3
0
02 Feb 2024
YOLO-World: Real-Time Open-Vocabulary Object Detection
Tianheng Cheng
Lin Song
Yixiao Ge
Wenyu Liu
Xinggang Wang
Ying Shan
VLM
ObjD
321
583
0
30 Jan 2024
MResT: Multi-Resolution Sensing for Real-Time Control with Vision-Language Models
Conference on Robot Learning (CoRL), 2024
Saumya Saxena
Mohit Sharma
Oliver Kroemer
214
4
0
25 Jan 2024
Generalizing Visual Question Answering from Synthetic to Human-Written Questions via a Chain of QA with a Large Language Model
European Conference on Artificial Intelligence (ECAI), 2024
Taehee Kim
Yeongjae Cho
Heejun Shin
Yohan Jo
Dongmyung Shin
291
6
0
12 Jan 2024
UMG-CLIP: A Unified Multi-Granularity Vision Generalist for Open-World Understanding
European Conference on Computer Vision (ECCV), 2024
Bowen Shi
Peisen Zhao
Zichen Wang
Yuhang Zhang
Yaoming Wang
...
Wenrui Dai
Junni Zou
Hongkai Xiong
Qi Tian
Xiaopeng Zhang
VLM
148
12
0
12 Jan 2024
GroundingGPT:Language Enhanced Multi-modal Grounding Model
Annual Meeting of the Association for Computational Linguistics (ACL), 2024
Zhaowei Li
Qi Xu
Dong Zhang
Hang Song
Yiqing Cai
...
Junting Pan
Zefeng Li
Van Tu Vu
Zhida Huang
Tao Wang
493
89
0
11 Jan 2024
An Open and Comprehensive Pipeline for Unified Object Grounding and Detection
Xiangyu Zhao
Yicheng Chen
Shilin Xu
Xiangtai Li
Xinjiang Wang
Yining Li
Haian Huang
ObjD
AI4CE
223
52
0
04 Jan 2024
Context-Guided Spatio-Temporal Video Grounding
Computer Vision and Pattern Recognition (CVPR), 2024
Xin Gu
Hengrui Fan
Yan Huang
Tiejian Luo
Libo Zhang
224
37
0
03 Jan 2024
Glance and Focus: Memory Prompting for Multi-Event Video Question Answering
Neural Information Processing Systems (NeurIPS), 2024
Ziyi Bai
Ruiping Wang
Xilin Chen
272
12
0
03 Jan 2024
Generating Enhanced Negatives for Training Language-Based Object Detectors
Computer Vision and Pattern Recognition (CVPR), 2023
Shiyu Zhao
Long Zhao
Vijay Kumar B.G
Yumin Suh
Dimitris N. Metaxas
Manmohan Chandraker
S. Schulter
ObjD
VLM
360
12
0
29 Dec 2023
Previous
1
2
3
4
5
6
...
12
13
14
Next