ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2026 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2104.12763
  4. Cited By
MDETR -- Modulated Detection for End-to-End Multi-Modal Understanding
v1v2 (latest)

MDETR -- Modulated Detection for End-to-End Multi-Modal Understanding

IEEE International Conference on Computer Vision (ICCV), 2021
26 April 2021
Aishwarya Kamath
Mannat Singh
Yann LeCun
Gabriel Synnaeve
Ishan Misra
Nicolas Carion
    ObjDVLM
ArXiv (abs)PDFHTMLGithub (1008★)

Papers citing "MDETR -- Modulated Detection for End-to-End Multi-Modal Understanding"

50 / 678 papers shown
E3AD: An Emotion-Aware Vision-Language-Action Model for Human-Centric End-to-End Autonomous Driving
E3AD: An Emotion-Aware Vision-Language-Action Model for Human-Centric End-to-End Autonomous Driving
Yihong Tang
Haicheng Liao
Tong Nie
Junlin He
Ao Qu
Kehua Chen
Wei Ma
Zhenning Li
Lijun Sun
Chengzhong Xu
127
1
0
04 Dec 2025
Making Dialogue Grounding Data Rich: A Three-Tier Data Synthesis Framework for Generalized Referring Expression Comprehension
Making Dialogue Grounding Data Rich: A Three-Tier Data Synthesis Framework for Generalized Referring Expression Comprehension
Juexi Shao
Siyou Li
Yujian Gan
Chris Madge
Vanja Karan
Massimo Poesio
140
0
0
02 Dec 2025
Artemis: Structured Visual Reasoning for Perception Policy Learning
Artemis: Structured Visual Reasoning for Perception Policy Learning
Wei Tang
Yanpeng Sun
Shan Zhang
Xiaofan Li
Piotr Koniusz
Wei Li
Na Zhao
Z. Li
LRMVLM
107
0
0
01 Dec 2025
SceneProp: Combining Neural Network and Markov Random Field for Scene-Graph Grounding
Keita Otani
Tatsuya Harada
76
0
0
30 Nov 2025
Advanced Data Collection Techniques in Cloud Security: A Multi-Modal Deep Learning Autoencoder Approach
Advanced Data Collection Techniques in Cloud Security: A Multi-Modal Deep Learning Autoencoder Approach
Aamiruddin Syed
Mohammed Ilyas Ahmad
54
0
0
26 Nov 2025
Thinking With Bounding Boxes: Enhancing Spatio-Temporal Video Grounding via Reinforcement Fine-Tuning
Thinking With Bounding Boxes: Enhancing Spatio-Temporal Video Grounding via Reinforcement Fine-Tuning
Xin Gu
H. Zhang
Qihang Fan
Jingxuan Niu
Zhipeng Zhang
Libo Zhang
G. Chen
Fan Chen
Longyin Wen
Sijie Zhu
AI4TSLRM
327
1
0
26 Nov 2025
Percept-WAM: Perception-Enhanced World-Awareness-Action Model for Robust End-to-End Autonomous Driving
Percept-WAM: Perception-Enhanced World-Awareness-Action Model for Robust End-to-End Autonomous Driving
J. N. Han
Meng Tian
Jiangtong Zhu
Fan He
Huixin Zhang
...
Siyuan Dong
Lu Hou
Qingqiu Huang
Xiaosong Jia
H. Xu
VLM
157
1
0
24 Nov 2025
QUILL: An Algorithm-Architecture Co-Design for Cache-Local Deformable Attention
QUILL: An Algorithm-Architecture Co-Design for Cache-Local Deformable Attention
Hyunwoo Oh
Hanning Chen
Sanggeon Yun
Yang Ni
Wenjun Huang
Tamoghno Das
Suyeon Jang
Mohsen Imani
VLM
162
0
0
17 Nov 2025
Backdoor Attacks on Open Vocabulary Object Detectors via Multi-Modal Prompt Tuning
Backdoor Attacks on Open Vocabulary Object Detectors via Multi-Modal Prompt Tuning
Ankita Raj
Chetan Arora
ObjDAAMLVLM
282
0
0
16 Nov 2025
LIHE: Linguistic Instance-Split Hyperbolic-Euclidean Framework for Generalized Weakly-Supervised Referring Expression Comprehension
LIHE: Linguistic Instance-Split Hyperbolic-Euclidean Framework for Generalized Weakly-Supervised Referring Expression ComprehensionConference on Empirical Methods in Natural Language Processing (EMNLP), 2025
X. Shi
Silin Cheng
Sirui Zhao
Yunhan Jiang
Enhong Chen
Yang Liu
Sebastien Ourselin
152
0
0
15 Nov 2025
Semantic-Guided Natural Language and Visual Fusion for Cross-Modal Interaction Based on Tiny Object Detection
Semantic-Guided Natural Language and Visual Fusion for Cross-Modal Interaction Based on Tiny Object Detection
Xian-Hong Huang
Hui-Kai Su
Chi-Chia Sun
Jun-Wei Hsieh
ObjD
419
0
0
07 Nov 2025
SIMS-V: Simulated Instruction-Tuning for Spatial Video Understanding
SIMS-V: Simulated Instruction-Tuning for Spatial Video Understanding
Ellis L Brown
Arijit Ray
Ranjay Krishna
Ross B. Girshick
Rob Fergus
Saining Xie
355
6
0
06 Nov 2025
GranViT: A Fine-Grained Vision Model With Autoregressive Perception For MLLMs
GranViT: A Fine-Grained Vision Model With Autoregressive Perception For MLLMs
Guanghao Zheng
Bowen Shi
Mingxing Xu
Ruoyu Sun
Peisen Zhao
...
Wenrui Dai
Junni Zou
Hongkai Xiong
Xiaopeng Zhang
Qi Tian
VLM
161
0
0
23 Oct 2025
MaskCaptioner: Learning to Jointly Segment and Caption Object Trajectories in Videos
MaskCaptioner: Learning to Jointly Segment and Caption Object Trajectories in Videos
Gabriel Fiastre
Antoine Yang
Cordelia Schmid
VOS
446
1
0
16 Oct 2025
Spatial Preference Rewarding for MLLMs Spatial Understanding
Spatial Preference Rewarding for MLLMs Spatial Understanding
Han Qiu
Peng Gao
Lewei Lu
Xiaoqin Zhang
Ling Shao
Shijian Lu
LRM
134
0
0
16 Oct 2025
MOBIUS: Big-to-Mobile Universal Instance Segmentation via Multi-modal Bottleneck Fusion and Calibrated Decoder Pruning
MOBIUS: Big-to-Mobile Universal Instance Segmentation via Multi-modal Bottleneck Fusion and Calibrated Decoder Pruning
Mattia Segu
Marta Tintore Gazulla
Yongqin Xian
Luc Van Gool
Federico Tombari
86
0
0
16 Oct 2025
What "Not" to Detect: Negation-Aware VLMs via Structured Reasoning and Token Merging
What "Not" to Detect: Negation-Aware VLMs via Structured Reasoning and Token Merging
Inha Kang
Youngsun Lim
S. Lee
Jiho Choi
Junsuk Choe
Hyunjung Shim
99
0
0
15 Oct 2025
Detect Anything via Next Point Prediction
Detect Anything via Next Point Prediction
Qing Jiang
Junan Huo
Xingyu Chen
Yuda Xiong
Zhaoyang Zeng
Yihao Chen
Tianhe Ren
Junzhi Yu
Lei Zhang
ObjD
211
11
0
14 Oct 2025
Image-to-Video Transfer Learning based on Image-Language Foundation Models: A Comprehensive Survey
Image-to-Video Transfer Learning based on Image-Language Foundation Models: A Comprehensive Survey
Jinxuan Li
Chaolei Tan
Haoxuan Chen
Jianxin Ma
Jian-Fang Hu
Wei-Shi Zheng
Jianhuang Lai
VLM
149
1
0
12 Oct 2025
Synthetic Object Compositions for Scalable and Accurate Learning in Detection, Segmentation, and Grounding
Synthetic Object Compositions for Scalable and Accurate Learning in Detection, Segmentation, and Grounding
Weikai Huang
Jieyu Zhang
Taoyang Jia
Chenhao Zheng
Ziqi Gao
J. S. Park
Winson Han
Ranjay Krishna
226
0
0
10 Oct 2025
A Multimodal Depth-Aware Method For Embodied Reference Understanding
A Multimodal Depth-Aware Method For Embodied Reference Understanding
Fevziye Irem Eyiokur
Dogucan Yaman
H. K. Ekenel
Alexander Waibel
ObjD
338
0
0
09 Oct 2025
Referring Expression Comprehension for Small Objects
Referring Expression Comprehension for Small Objects
Kanoko Goto
Takumi Hirose
Mahiro Ukai
Shuhei Kurita
Nakamasa Inoue
ObjD
146
1
0
04 Oct 2025
CoT Referring: Improving Referring Expression Tasks with Grounded Reasoning
CoT Referring: Improving Referring Expression Tasks with Grounded Reasoning
Qihua Dong
Luis Figueroa
Handong Zhao
Kushal Kafle
Jason Kuen
Zhihong Ding
Scott D. Cohen
Y. Fu
ObjDLRM
196
0
0
03 Oct 2025
Patch-as-Decodable-Token: Towards Unified Multi-Modal Vision Tasks in MLLMs
Patch-as-Decodable-Token: Towards Unified Multi-Modal Vision Tasks in MLLMs
Yongyi Su
H. Zhang
Shijie Li
Nanqing Liu
Jingyi Liao
...
Chen Li
Nancy F. Chen
Shuicheng Yan
Xulei Yang
Xun Xu
MLLMVLM
178
3
0
02 Oct 2025
VLM-FO1: Bridging the Gap Between High-Level Reasoning and Fine-Grained Perception in VLMs
VLM-FO1: Bridging the Gap Between High-Level Reasoning and Fine-Grained Perception in VLMs
Peng Liu
H. Shen
Chunxin Fang
Zhicheng Sun
Jiajia Liao
T. Zhao
MLLMObjDVLMLRM
214
2
0
30 Sep 2025
TimeScope: Towards Task-Oriented Temporal Grounding In Long Videos
TimeScope: Towards Task-Oriented Temporal Grounding In Long Videos
Xiangrui Liu
Minghao Qin
Yan Shu
Zhengyang Liang
Yang Tian
Chen Jason Zhang
Bo Zhao
Zheng Liu
319
0
0
30 Sep 2025
NePTune: A Neuro-Pythonic Framework for Tunable Compositional Reasoning on Vision-Language
NePTune: A Neuro-Pythonic Framework for Tunable Compositional Reasoning on Vision-Language
Danial Kamali
Parisa Kordjamshidi
NAILRMCoGeVLM
800
3
0
30 Sep 2025
Talk in Pieces, See in Whole: Disentangling and Hierarchical Aggregating Representations for Language-based Object Detection
Talk in Pieces, See in Whole: Disentangling and Hierarchical Aggregating Representations for Language-based Object Detection
Sojung An
Kwanyong Park
Yong Jae Lee
Donghyun Kim
158
0
0
29 Sep 2025
VideoChat-R1.5: Visual Test-Time Scaling to Reinforce Multimodal Reasoning by Iterative Perception
VideoChat-R1.5: Visual Test-Time Scaling to Reinforce Multimodal Reasoning by Iterative Perception
Ziang Yan
Xinhao Li
Yinan He
Zhengrong Yue
Xiangyu Zeng
Yali Wang
Yu Qiao
Limin Wang
Yi Wang
MLLMVLMLRM
213
13
0
25 Sep 2025
MARS2 2025 Challenge on Multimodal Reasoning: Datasets, Methods, Results, Discussion, and Outlook
MARS2 2025 Challenge on Multimodal Reasoning: Datasets, Methods, Results, Discussion, and Outlook
Peng Xu
Shengwu Xiong
Jiajun Zhang
Yaxiong Chen
Bowen Zhou
...
Yang Yang
Yanglin Deng
Yashu Kang
Ye Yuan
Y. Wen
LRM
127
1
0
17 Sep 2025
Improving Generalized Visual Grounding with Instance-aware Joint Learning
Improving Generalized Visual Grounding with Instance-aware Joint LearningIEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2025
Ming Dai
Wenxuan Cheng
Jiang-Jiang Liu
Lingfeng Yang
Zhenhua Feng
Wankou Yang
Jingdong Wang
ObjDISeg
255
4
0
17 Sep 2025
TFANet: Three-Stage Image-Text Feature Alignment Network for Robust Referring Image Segmentation
TFANet: Three-Stage Image-Text Feature Alignment Network for Robust Referring Image Segmentation
Qianqi Lu
Yuxiang Xie
Jing Zhang
Shiwei Zou
Yan Chen
Xidao Luan
146
0
0
16 Sep 2025
Multi-animal tracking in Transition: Comparative Insights into Established and Emerging Methods
Multi-animal tracking in Transition: Comparative Insights into Established and Emerging MethodsSmart Agricultural Technology (SAT), 2025
Anne Marthe Sophie Ngo Bibinbe
Patrick Gagnon
Jamie Ahloy-Dallaire
Eric R. Paquet
VOT
214
0
0
15 Sep 2025
Towards Understanding Visual Grounding in Visual Language Models
Towards Understanding Visual Grounding in Visual Language Models
Georgios Pantazopoulos
Eda B. Özyiğit
ObjD
315
3
0
12 Sep 2025
WAVE-DETR Multi-Modal Visible and Acoustic Real-Life Drone Detector
WAVE-DETR Multi-Modal Visible and Acoustic Real-Life Drone Detector
Razvan Stefanescu
Ethan Oh
Ruben Vazquez
Chris Mesterharm
Constantin Serban
R. Chadha
220
0
0
11 Sep 2025
Visual Grounding from Event Cameras
Visual Grounding from Event Cameras
Lingdong Kong
Dongyue Lu
Ao Liang
Rong Li
Yuhao Dong
Tianshuai Hu
Lai Xing Ng
Wei Tsang Ooi
Benoit R. Cottereau
VGen
133
1
0
11 Sep 2025
Light-Weight Cross-Modal Enhancement Method with Benchmark Construction for UAV-based Open-Vocabulary Object Detection
Light-Weight Cross-Modal Enhancement Method with Benchmark Construction for UAV-based Open-Vocabulary Object Detection
Zhenhai Weng
Xinjie Li
Can Wu
Weijie He
Jianfeng Lv
Dong Zhou
Zhongliang Yu
ObjDVLM
243
0
0
07 Sep 2025
PropVG: End-to-End Proposal-Driven Visual Grounding with Multi-Granularity Discrimination
PropVG: End-to-End Proposal-Driven Visual Grounding with Multi-Granularity Discrimination
Ming Dai
Wenxuan Cheng
Jiedong Zhuang
Jiang-Jiang Liu
Hongshen Zhao
Zhenhua Feng
Wankou Yang
ObjD
229
3
0
05 Sep 2025
GENNAV: Polygon Mask Generation for Generalized Referring Navigable Regions
GENNAV: Polygon Mask Generation for Generalized Referring Navigable Regions
Kei Katsumata
Yui Iioka
Naoki Hosomi
Teruhisa Misu
Kentaro Yamada
K. Sugiura
127
0
0
28 Aug 2025
MRFD: Multi-Region Fusion Decoding with Self-Consistency for Mitigating Hallucinations in LVLMs
MRFD: Multi-Region Fusion Decoding with Self-Consistency for Mitigating Hallucinations in LVLMs
Haonan Ge
Yiwei Wang
Ming-Hsuan Yang
Yujun Cai
177
5
0
14 Aug 2025
DocThinker: Explainable Multimodal Large Language Models with Rule-based Reinforcement Learning for Document Understanding
DocThinker: Explainable Multimodal Large Language Models with Rule-based Reinforcement Learning for Document Understanding
Wenwen Yu
Zhibo Yang
Yuliang Liu
Xiang Bai
MLLMOffRLLRM
92
4
0
12 Aug 2025
Textual Inversion for Efficient Adaptation of Open-Vocabulary Object Detectors Without Forgetting
Textual Inversion for Efficient Adaptation of Open-Vocabulary Object Detectors Without Forgetting
Frank Ruis
Gertjan J. Burghouts
Hugo Kuijf
ObjDVLM
102
0
0
07 Aug 2025
Latent Expression Generation for Referring Image Segmentation and Grounding
Latent Expression Generation for Referring Image Segmentation and Grounding
S. Yu
Joonbeom Hong
Joonseok Lee
Jeany Son
ObjD
201
1
0
07 Aug 2025
Referring Remote Sensing Image Segmentation with Cross-view Semantics Interaction Network
Referring Remote Sensing Image Segmentation with Cross-view Semantics Interaction Network
Jiaxing Yang
Lihe Zhang
Huchuan Lu
151
0
0
02 Aug 2025
Multimodal Referring Segmentation: A Survey
Multimodal Referring Segmentation: A Survey
Henghui Ding
Song Tang
Shuting He
Chang-rui Liu
Zuxuan Wu
Yu-Gang Jiang
385
11
0
01 Aug 2025
Beyond Gloss: A Hand-Centric Framework for Gloss-Free Sign Language Translation
Beyond Gloss: A Hand-Centric Framework for Gloss-Free Sign Language Translation
Sobhan Asasi
Mohamed Ilyas Lakhal
Ozge Mercanoglu Sincan
Richard Bowden
SLR
202
1
0
31 Jul 2025
Modality-Aware Feature Matching: A Comprehensive Review of Single- and Cross-Modality Techniques
Modality-Aware Feature Matching: A Comprehensive Review of Single- and Cross-Modality Techniques
Weide Liu
Wei Zhou
Jun Liu
Ping Hu
Jun Cheng
Jungong Han
Weisi Lin
3DV
226
3
0
30 Jul 2025
CAPE: A CLIP-Aware Pointing Ensemble of Complementary Heatmap Cues for Embodied Reference Understanding
CAPE: A CLIP-Aware Pointing Ensemble of Complementary Heatmap Cues for Embodied Reference Understanding
Fevziye Irem Eyiokur
Dogucan Yaman
H. K. Ekenel
Alexander Waibel
224
1
0
29 Jul 2025
Interpretable Open-Vocabulary Referring Object Detection with Reverse Contrast Attention
Interpretable Open-Vocabulary Referring Object Detection with Reverse Contrast Attention
Drandreb Earl O. Juanico
Rowel O. Atienza
Jeffrey Kenneth Go
ObjD
280
0
0
26 Jul 2025
Talk2Event: Grounded Understanding of Dynamic Scenes from Event Cameras
Talk2Event: Grounded Understanding of Dynamic Scenes from Event Cameras
Lingdong Kong
Dongyue Lu
Ao Liang
Rong Li
Yuhao Dong
Tianshuai Hu
Lai Xing Ng
Wei Tsang Ooi
Benoit R. Cottereau
VGen
312
4
0
23 Jul 2025
1234...121314
Next