ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2026 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2104.12763
  4. Cited By
MDETR -- Modulated Detection for End-to-End Multi-Modal Understanding
v1v2 (latest)

MDETR -- Modulated Detection for End-to-End Multi-Modal Understanding

IEEE International Conference on Computer Vision (ICCV), 2021
26 April 2021
Aishwarya Kamath
Mannat Singh
Yann LeCun
Gabriel Synnaeve
Ishan Misra
Nicolas Carion
    ObjDVLM
ArXiv (abs)PDFHTMLGithub (1008★)

Papers citing "MDETR -- Modulated Detection for End-to-End Multi-Modal Understanding"

50 / 678 papers shown
An Efficient and Effective Transformer Decoder-Based Framework for
  Multi-Task Visual Grounding
An Efficient and Effective Transformer Decoder-Based Framework for Multi-Task Visual GroundingEuropean Conference on Computer Vision (ECCV), 2024
Wei Chen
Mahdieh Hatamian
Yu Wu
241
16
0
02 Aug 2024
Look Hear: Gaze Prediction for Speech-directed Human Attention
Look Hear: Gaze Prediction for Speech-directed Human AttentionEuropean Conference on Computer Vision (ECCV), 2024
Sounak Mondal
Seoyoung Ahn
Zhibo Yang
Niranjan Balasubramanian
Dimitris Samaras
G. Zelinsky
Minh Hoai
407
3
0
28 Jul 2024
PartGLEE: A Foundation Model for Recognizing and Parsing Any Objects
PartGLEE: A Foundation Model for Recognizing and Parsing Any Objects
Junyi Li
Junfeng Wu
Weizhi Zhao
Song Bai
Xiang Bai
229
13
0
23 Jul 2024
HAPFI: History-Aware Planning based on Fused Information
HAPFI: History-Aware Planning based on Fused Information
Sujin Jeon
Suyeon Shin
Byoung-Tak Zhang
197
1
0
23 Jul 2024
Accelerating Pre-training of Multimodal LLMs via Chain-of-Sight
Accelerating Pre-training of Multimodal LLMs via Chain-of-Sight
Ziyuan Huang
Kaixiang Ji
Biao Gong
Zhiwu Qing
Qinglong Zhang
Kecheng Zheng
Jian Wang
Jingdong Chen
Ming Yang
LRM
226
5
0
22 Jul 2024
Weak-to-Strong Compositional Learning from Generative Models for
  Language-based Object Detection
Weak-to-Strong Compositional Learning from Generative Models for Language-based Object Detection
Kwanyong Park
Kuniaki Saito
Donghyun Kim
VLMCoGe
251
5
0
21 Jul 2024
Learning Visual Grounding from Generative Vision and Language Model
Learning Visual Grounding from Generative Vision and Language Model
Shijie Wang
Dahun Kim
A. Taalimi
Chen Sun
Weicheng Kuo
ObjD
286
18
0
18 Jul 2024
SDPT: Synchronous Dual Prompt Tuning for Fusion-based Visual-Language
  Pre-trained Models
SDPT: Synchronous Dual Prompt Tuning for Fusion-based Visual-Language Pre-trained Models
Yang Zhou
Yongjian Wu
Jiya Saiyin
Bingzheng Wei
Maode Lai
Eric Chang
Yan Xu
VLM
225
2
0
16 Jul 2024
OVLW-DETR: Open-Vocabulary Light-Weighted Detection Transformer
OVLW-DETR: Open-Vocabulary Light-Weighted Detection Transformer
Yu Wang
Xiangbo Su
Qiang Chen
Xinyu Zhang
Teng Xi
Kun Yao
Errui Ding
Qiang Chen
Jingdong Wang
ObjDVLM
101
2
0
15 Jul 2024
Pathformer3D: A 3D Scanpath Transformer for 360° Images
Pathformer3D: A 3D Scanpath Transformer for 360° Images
Rong Quan
Yantao Lai
Mengyu Qiu
Dong Liang
ViT
197
0
0
15 Jul 2024
Plain-Det: A Plain Multi-Dataset Object Detector
Plain-Det: A Plain Multi-Dataset Object Detector
Cheng Shi
Yuchen Zhu
Sibei Yang
ObjDVLM
238
8
0
14 Jul 2024
Layer-Wise Relevance Propagation with Conservation Property for ResNet
Layer-Wise Relevance Propagation with Conservation Property for ResNet
Seitaro Otsuki
T. Iida
Félix Doublet
Tsubasa Hirakawa
Takayoshi Yamashita
H. Fujiyoshi
Komei Sugiura
FAtt
308
8
0
12 Jul 2024
Textual Query-Driven Mask Transformer for Domain Generalized
  Segmentation
Textual Query-Driven Mask Transformer for Domain Generalized Segmentation
Byeonghyun Pak
Byeongju Woo
Sunghwan Kim
Dae-Hwan Kim
Hoseong Kim
420
20
0
12 Jul 2024
SHERL: Synthesizing High Accuracy and Efficient Memory for
  Resource-Limited Transfer Learning
SHERL: Synthesizing High Accuracy and Efficient Memory for Resource-Limited Transfer Learning
Haiwen Diao
Bo Wan
Xu Jia
Yunzhi Zhuge
Ying Zhang
Huchuan Lu
Long Chen
VLM
240
11
0
10 Jul 2024
ActionVOS: Actions as Prompts for Video Object Segmentation
ActionVOS: Actions as Prompts for Video Object Segmentation
Liangyang Ouyang
Ruicong Liu
Yifei Huang
Ryosuke Furuta
Yoichi Sato
VOS
212
9
0
10 Jul 2024
Aligning Cyber Space with Physical World: A Comprehensive Survey on
  Embodied AI
Aligning Cyber Space with Physical World: A Comprehensive Survey on Embodied AI
Zehua Wang
Weixing Chen
Yongjie Bai
Xiaodan Liang
Guanbin Li
Wen Gao
Liang Lin
LM&RoSyDaAI4CE
619
190
0
09 Jul 2024
Multi-Object Hallucination in Vision-Language Models
Multi-Object Hallucination in Vision-Language Models
Xuweiyi Chen
Ziqiao Ma
Xuejun Zhang
Sihan Xu
Shengyi Qian
Jianing Yang
David Fouhey
Joyce Chai
304
42
0
08 Jul 2024
Described Spatial-Temporal Video Detection
Described Spatial-Temporal Video Detection
Wei Ji
Xiangyan Liu
Yingfei Sun
Jiajun Deng
You Qin
Ammar Nuwanna
Mengyao Qiu
Lina Wei
Roger Zimmermann
278
3
0
08 Jul 2024
FALIP: Visual Prompt as Foveal Attention Boosts CLIP Zero-Shot
  Performance
FALIP: Visual Prompt as Foveal Attention Boosts CLIP Zero-Shot Performance
Jiedong Zhuang
Jiaqi Hu
Lianrui Mu
Rui Hu
Xiaoyu Liang
Jiangnan Ye
Haoji Hu
CLIPVLM
329
7
0
08 Jul 2024
Unlocking Textual and Visual Wisdom: Open-Vocabulary 3D Object Detection
  Enhanced by Comprehensive Guidance from Text and Image
Unlocking Textual and Visual Wisdom: Open-Vocabulary 3D Object Detection Enhanced by Comprehensive Guidance from Text and Image
Pengkun Jiao
Na Zhao
Yue Yu
Yu-Gang Jiang
VLMObjD
257
12
0
07 Jul 2024
Dude: Dual Distribution-Aware Context Prompt Learning For Large
  Vision-Language Model
Dude: Dual Distribution-Aware Context Prompt Learning For Large Vision-Language Model
D. M. Nguyen
An T. Le
Trung Q. Nguyen
Nghiem Tuong Diep
Tai Nguyen
D. Duong-Tran
Jan Peters
Li Shen
Mathias Niepert
Daniel Sonntag
VLM
201
6
0
05 Jul 2024
VoxAct-B: Voxel-Based Acting and Stabilizing Policy for Bimanual
  Manipulation
VoxAct-B: Voxel-Based Acting and Stabilizing Policy for Bimanual Manipulation
I-Chun Arthur Liu
Sicheng He
Daniel Seita
Gaurav Sukhatme
LM&Ro
270
23
0
04 Jul 2024
SafaRi:Adaptive Sequence Transformer for Weakly Supervised Referring
  Expression Segmentation
SafaRi:Adaptive Sequence Transformer for Weakly Supervised Referring Expression Segmentation
Sayan Nag
Koustava Goswami
Srikrishna Karanam
295
6
0
02 Jul 2024
Camera-LiDAR Cross-modality Gait Recognition
Camera-LiDAR Cross-modality Gait Recognition
Wenxuan Guo
Yingping Liang
Zhiyu Pan
Ziheng Xi
Jianjiang Feng
Jie Zhou
CVBM
296
7
0
02 Jul 2024
The Solution for the ICCV 2023 Perception Test Challenge 2023 -- Task 6
  -- Grounded videoQA
The Solution for the ICCV 2023 Perception Test Challenge 2023 -- Task 6 -- Grounded videoQA
Hailiang Zhang
Dian Chao
Zhihao Guan
Yang Yang
224
0
0
02 Jul 2024
Object Segmentation from Open-Vocabulary Manipulation Instructions Based
  on Optimal Transport Polygon Matching with Multimodal Foundation Models
Object Segmentation from Open-Vocabulary Manipulation Instructions Based on Optimal Transport Polygon Matching with Multimodal Foundation Models
Takayuki Nishimura
Katsuyuki Kuyo
Motonari Kambara
Komei Sugiura
DiffM
261
1
0
01 Jul 2024
Auto Cherry-Picker: Learning from High-quality Generative Data Driven by Language
Auto Cherry-Picker: Learning from High-quality Generative Data Driven by Language
Yicheng Chen
Xiangtai Li
Yining Li
Yanhong Zeng
Jianzong Wu
Xiangyu Zhao
Kai Chen
VLMDiffM
427
3
0
28 Jun 2024
EVF-SAM: Early Vision-Language Fusion for Text-Prompted Segment Anything Model
EVF-SAM: Early Vision-Language Fusion for Text-Prompted Segment Anything Model
Yuxuan Zhang
Tianheng Cheng
Lianghui Zhu
Lei Liu
Heng Liu
Longjin Ran
Xiaoxin Chen
Xiaoxin Chen
Wenyu Liu
Xinggang Wang
VLM
600
57
0
28 Jun 2024
Lifelong Robot Library Learning: Bootstrapping Composable and
  Generalizable Skills for Embodied Control with Language Models
Lifelong Robot Library Learning: Bootstrapping Composable and Generalizable Skills for Embodied Control with Language Models
Georgios Tziafas
Hamidreza Kasaei
KELMLM&Ro
317
14
0
26 Jun 2024
Towards Open-World Grasping with Large Vision-Language Models
Towards Open-World Grasping with Large Vision-Language Models
Georgios Tziafas
Hamidreza Kasaei
LM&RoLRM
339
22
0
26 Jun 2024
ScanFormer: Referring Expression Comprehension by Iteratively Scanning
ScanFormer: Referring Expression Comprehension by Iteratively Scanning
Wei Su
Peihan Miao
Huanzhang Dou
Xi Li
ObjD
278
15
0
26 Jun 2024
Revisiting Referring Expression Comprehension Evaluation in the Era of
  Large Multimodal Models
Revisiting Referring Expression Comprehension Evaluation in the Era of Large Multimodal Models
Jierun Chen
Fangyun Wei
Jinjing Zhao
Sizhe Song
Bohuai Wu
Zhuoxuan Peng
S.-H. Gary Chan
Hongyang R. Zhang
255
33
0
24 Jun 2024
LLARVA: Vision-Action Instruction Tuning Enhances Robot Learning
LLARVA: Vision-Action Instruction Tuning Enhances Robot Learning
Dantong Niu
Yuvan Sharma
Giscard Biamby
Jerome Quenum
Yutong Bai
Baifeng Shi
Trevor Darrell
Roei Herzig
LM&RoVLM
247
61
0
17 Jun 2024
A Survey on Text-guided 3D Visual Grounding: Elements, Recent Advances,
  and Future Directions
A Survey on Text-guided 3D Visual Grounding: Elements, Recent Advances, and Future DirectionsIEEE Transactions on Neural Networks and Learning Systems (TNNLS), 2024
Daizong Liu
Yang Liu
Wencan Huang
Wei Hu
LM&Ro
360
32
0
09 Jun 2024
One Perturbation is Enough: On Generating Universal Adversarial Perturbations against Vision-Language Pre-training Models
One Perturbation is Enough: On Generating Universal Adversarial Perturbations against Vision-Language Pre-training Models
Hao Fang
Jiawei Kong
Wenbo Yu
Bin Chen
Jiawei Li
Hao Wu
Ke Xu
Ke Xu
AAMLVLM
424
28
0
08 Jun 2024
Bootstrapping Referring Multi-Object Tracking
Bootstrapping Referring Multi-Object Tracking
Yani Zhang
Dongming Wu
Wencheng Han
Xingping Dong
378
21
0
07 Jun 2024
Diffusion-Refined VQA Annotations for Semi-Supervised Gaze Following
Diffusion-Refined VQA Annotations for Semi-Supervised Gaze Following
Qiaomu Miao
Alexandros Graikos
Jingwei Zhang
Sounak Mondal
Minh Hoai
Dimitris Samaras
347
1
0
04 Jun 2024
Multi-layer Learnable Attention Mask for Multimodal Tasks
Multi-layer Learnable Attention Mask for Multimodal Tasks
Wayner Barrios
SouYoung Jin
183
2
0
04 Jun 2024
ELSA: Evaluating Localization of Social Activities in Urban Streets
ELSA: Evaluating Localization of Social Activities in Urban Streets
Maryam Hosseini
Marco Cipriano
Sedigheh Eslami
Daniel Hodczak
Liu Liu
Andres Sevtsuk
Gerard de Melo
193
2
0
03 Jun 2024
SAM as the Guide: Mastering Pseudo-Label Refinement in Semi-Supervised
  Referring Expression Segmentation
SAM as the Guide: Mastering Pseudo-Label Refinement in Semi-Supervised Referring Expression Segmentation
Danni Yang
Jiayi Ji
Yiwei Ma
Tianyu Guo
Haowei Wang
Xiaoshuai Sun
Rongrong Ji
ISegVLM
339
14
0
03 Jun 2024
Collaborative Novel Object Discovery and Box-Guided Cross-Modal Alignment for Open-Vocabulary 3D Object Detection
Collaborative Novel Object Discovery and Box-Guided Cross-Modal Alignment for Open-Vocabulary 3D Object Detection
Yang Cao
Yihan Zeng
Hang Xu
Dan Xu
3DPCObjD
329
15
0
02 Jun 2024
RTGen: Generating Region-Text Pairs for Open-Vocabulary Object Detection
RTGen: Generating Region-Text Pairs for Open-Vocabulary Object Detection
Fangyi Chen
Han Zhang
Zhantao Yang
Hao Chen
Kai Hu
Marios Savvides
ObjDVLM
214
7
0
30 May 2024
Intent3D: 3D Object Detection in RGB-D Scans Based on Human Intention
Intent3D: 3D Object Detection in RGB-D Scans Based on Human Intention
Weitai Kang
Mengxue Qu
Jyoti Kini
Yunchao Wei
Mubarak Shah
Yan Yan
LM&Ro3DPC
274
16
0
28 May 2024
LLM-Optic: Unveiling the Capabilities of Large Language Models for
  Universal Visual Grounding
LLM-Optic: Unveiling the Capabilities of Large Language Models for Universal Visual Grounding
Haoyu Zhao
Wenhang Ge
Ying-Cong Chen
ObjDMLLMVLM
292
7
0
27 May 2024
VICtoR: Learning Hierarchical Vision-Instruction Correlation Rewards for Long-horizon Manipulation
VICtoR: Learning Hierarchical Vision-Instruction Correlation Rewards for Long-horizon Manipulation
Kuo-Han Hung
Pang-Chi Lo
Jia-Fong Yeh
Han-Yuan Hsu
Yi-Ting Chen
Winston H. Hsu
409
2
0
26 May 2024
V-Zen: Efficient GUI Understanding and Precise Grounding With A Novel
  Multimodal LLM
V-Zen: Efficient GUI Understanding and Precise Grounding With A Novel Multimodal LLM
Abdur Rahman
Rajat Chawla
Muskaan Kumar
Arkajit Datta
Adarsh Jha
NS Mukunda
Ishaan Bhola
327
4
0
24 May 2024
A Survey on Vision-Language-Action Models for Embodied AI
A Survey on Vision-Language-Action Models for Embodied AI
Yueen Ma
Zixing Song
Yuzheng Zhuang
Jianye Hao
Irwin King
LM&Ro
889
166
0
23 May 2024
Open-Vocabulary Spatio-Temporal Action Detection
Open-Vocabulary Spatio-Temporal Action Detection
Tao Wu
Shuqiu Ge
Jie Qin
Gangshan Wu
Limin Wang
ObjD
195
9
0
17 May 2024
Grounded 3D-LLM with Referent Tokens
Grounded 3D-LLM with Referent Tokens
Yilun Chen
Shuai Yang
Haifeng Huang
Tai Wang
Ruiyuan Lyu
Runsen Xu
Dahua Lin
Jiangmiao Pang
336
77
0
16 May 2024
Grounding DINO 1.5: Advance the "Edge" of Open-Set Object Detection
Grounding DINO 1.5: Advance the "Edge" of Open-Set Object Detection
Tianhe Ren
Qing Jiang
Shilong Liu
Zhaoyang Zeng
Wenlong Liu
...
Hao Zhang
Feng Li
Peijun Tang
Kent Yu
Lei Zhang
ObjDVLM
387
95
0
16 May 2024
Previous
12345...121314
Next