v1v2 (latest)

MDETR -- Modulated Detection for End-to-End Multi-Modal Understanding

IEEE International Conference on Computer Vision (ICCV), 2021

26 April 2021

ArXiv (abs)PDF HTML Github (1008★)

Papers citing "MDETR -- Modulated Detection for End-to-End Multi-Modal Understanding"

50 / 678 papers shown

E3AD: An Emotion-Aware Vision-Language-Action Model for Human-Centric End-to-End Autonomous Driving

127

04 Dec 2025

Making Dialogue Grounding Data Rich: A Three-Tier Data Synthesis Framework for Generalized Referring Expression Comprehension

140

02 Dec 2025

Artemis: Structured Visual Reasoning for Perception Policy Learning

107

01 Dec 2025

SceneProp: Combining Neural Network and Markov Random Field for Scene-Graph Grounding

Keita Otani

Tatsuya Harada

30 Nov 2025

Advanced Data Collection Techniques in Cloud Security: A Multi-Modal Deep Learning Autoencoder Approach

Aamiruddin Syed

Mohammed Ilyas Ahmad

26 Nov 2025

Thinking With Bounding Boxes: Enhancing Spatio-Temporal Video Grounding via Reinforcement Fine-Tuning

327

26 Nov 2025

Percept-WAM: Perception-Enhanced World-Awareness-Action Model for Robust End-to-End Autonomous Driving

...

157

24 Nov 2025

QUILL: An Algorithm-Architecture Co-Design for Cache-Local Deformable Attention

162

17 Nov 2025

Backdoor Attacks on Open Vocabulary Object Detectors via Multi-Modal Prompt Tuning

Ankita Raj

Chetan Arora

ObjD AAML VLM

282

16 Nov 2025

LIHE: Linguistic Instance-Split Hyperbolic-Euclidean Framework for Generalized Weakly-Supervised Referring Expression ComprehensionConference on Empirical Methods in Natural Language Processing (EMNLP), 2025

152

15 Nov 2025

Semantic-Guided Natural Language and Visual Fusion for Cross-Modal Interaction Based on Tiny Object Detection

419

07 Nov 2025

SIMS-V: Simulated Instruction-Tuning for Spatial Video Understanding

355

06 Nov 2025

GranViT: A Fine-Grained Vision Model With Autoregressive Perception For MLLMs

...

161

23 Oct 2025

MaskCaptioner: Learning to Jointly Segment and Caption Object Trajectories in Videos

446

16 Oct 2025

Spatial Preference Rewarding for MLLMs Spatial Understanding

134

16 Oct 2025

MOBIUS: Big-to-Mobile Universal Instance Segmentation via Multi-modal Bottleneck Fusion and Calibrated Decoder Pruning

Mattia Segu

Marta Tintore Gazulla

Yongqin Xian

Luc Van Gool

Federico Tombari

16 Oct 2025

What "Not" to Detect: Negation-Aware VLMs via Structured Reasoning and Token Merging

15 Oct 2025

Detect Anything via Next Point Prediction

211

14 Oct 2025

Image-to-Video Transfer Learning based on Image-Language Foundation Models: A Comprehensive Survey

149

12 Oct 2025

Synthetic Object Compositions for Scalable and Accurate Learning in Detection, Segmentation, and Grounding

226

10 Oct 2025

A Multimodal Depth-Aware Method For Embodied Reference Understanding

338

09 Oct 2025

Referring Expression Comprehension for Small Objects

146

04 Oct 2025

CoT Referring: Improving Referring Expression Tasks with Grounded Reasoning

196

03 Oct 2025

Patch-as-Decodable-Token: Towards Unified Multi-Modal Vision Tasks in MLLMs

...

178

02 Oct 2025

VLM-FO1: Bridging the Gap Between High-Level Reasoning and Fine-Grained Perception in VLMs

214

30 Sep 2025

TimeScope: Towards Task-Oriented Temporal Grounding In Long Videos

319

30 Sep 2025

NePTune: A Neuro-Pythonic Framework for Tunable Compositional Reasoning on Vision-Language

Danial Kamali

Parisa Kordjamshidi

NAI LRM CoGe VLM

800

30 Sep 2025

Talk in Pieces, See in Whole: Disentangling and Hierarchical Aggregating Representations for Language-based Object Detection

158

29 Sep 2025

VideoChat-R1.5: Visual Test-Time Scaling to Reinforce Multimodal Reasoning by Iterative Perception

213

25 Sep 2025

MARS2 2025 Challenge on Multimodal Reasoning: Datasets, Methods, Results, Discussion, and Outlook

...

127

17 Sep 2025

Improving Generalized Visual Grounding with Instance-aware Joint LearningIEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2025

255

17 Sep 2025

TFANet: Three-Stage Image-Text Feature Alignment Network for Robust Referring Image Segmentation

146

16 Sep 2025

Multi-animal tracking in Transition: Comparative Insights into Established and Emerging MethodsSmart Agricultural Technology (SAT), 2025

Anne Marthe Sophie Ngo Bibinbe

214

15 Sep 2025

Towards Understanding Visual Grounding in Visual Language Models

Georgios Pantazopoulos

Eda B. Özyiğit

ObjD

315

12 Sep 2025

WAVE-DETR Multi-Modal Visible and Acoustic Real-Life Drone Detector

220

11 Sep 2025

Visual Grounding from Event Cameras

133

11 Sep 2025

Light-Weight Cross-Modal Enhancement Method with Benchmark Construction for UAV-based Open-Vocabulary Object Detection

243

07 Sep 2025

PropVG: End-to-End Proposal-Driven Visual Grounding with Multi-Granularity Discrimination

229

05 Sep 2025

GENNAV: Polygon Mask Generation for Generalized Referring Navigable Regions

127

28 Aug 2025

MRFD: Multi-Region Fusion Decoding with Self-Consistency for Mitigating Hallucinations in LVLMs

177

14 Aug 2025

DocThinker: Explainable Multimodal Large Language Models with Rule-based Reinforcement Learning for Document Understanding

12 Aug 2025

Textual Inversion for Efficient Adaptation of Open-Vocabulary Object Detectors Without Forgetting

102

07 Aug 2025

Latent Expression Generation for Referring Image Segmentation and Grounding

201

07 Aug 2025

Referring Remote Sensing Image Segmentation with Cross-view Semantics Interaction Network

Jiaxing Yang

Lihe Zhang

Huchuan Lu

151

02 Aug 2025

Multimodal Referring Segmentation: A Survey

385

01 Aug 2025

Beyond Gloss: A Hand-Centric Framework for Gloss-Free Sign Language Translation

Sobhan Asasi

Mohamed Ilyas Lakhal

Ozge Mercanoglu Sincan

Richard Bowden

SLR

202

31 Jul 2025

Modality-Aware Feature Matching: A Comprehensive Review of Single- and Cross-Modality Techniques

226

30 Jul 2025

CAPE: A CLIP-Aware Pointing Ensemble of Complementary Heatmap Cues for Embodied Reference Understanding

224

29 Jul 2025

Interpretable Open-Vocabulary Referring Object Detection with Reverse Contrast Attention

Drandreb Earl O. Juanico

Rowel O. Atienza

Jeffrey Kenneth Go

ObjD

280

26 Jul 2025

Talk2Event: Grounded Understanding of Dynamic Scenes from Event Cameras

312

23 Jul 2025