ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2104.12763
  4. Cited By
MDETR -- Modulated Detection for End-to-End Multi-Modal Understanding

MDETR -- Modulated Detection for End-to-End Multi-Modal Understanding

26 April 2021
Aishwarya Kamath
Mannat Singh
Yann LeCun
Gabriel Synnaeve
Ishan Misra
Nicolas Carion
    ObjD
    VLM
ArXivPDFHTML

Papers citing "MDETR -- Modulated Detection for End-to-End Multi-Modal Understanding"

50 / 607 papers shown
Title
Unlocking Textual and Visual Wisdom: Open-Vocabulary 3D Object Detection
  Enhanced by Comprehensive Guidance from Text and Image
Unlocking Textual and Visual Wisdom: Open-Vocabulary 3D Object Detection Enhanced by Comprehensive Guidance from Text and Image
Pengkun Jiao
Na Zhao
Jingjing Chen
Yu-Gang Jiang
VLM
ObjD
21
3
0
07 Jul 2024
Dude: Dual Distribution-Aware Context Prompt Learning For Large
  Vision-Language Model
Dude: Dual Distribution-Aware Context Prompt Learning For Large Vision-Language Model
D. M. Nguyen
An T. Le
Trung Q. Nguyen
N. T. Diep
Tai Nguyen
D. Duong-Tran
Jan Peters
Li Shen
Mathias Niepert
Daniel Sonntag
VLM
37
3
0
05 Jul 2024
VoxAct-B: Voxel-Based Acting and Stabilizing Policy for Bimanual
  Manipulation
VoxAct-B: Voxel-Based Acting and Stabilizing Policy for Bimanual Manipulation
I-Chun Arthur Liu
Sicheng He
Daniel Seita
Gaurav Sukhatme
LM&Ro
33
11
0
04 Jul 2024
SafaRi:Adaptive Sequence Transformer for Weakly Supervised Referring
  Expression Segmentation
SafaRi:Adaptive Sequence Transformer for Weakly Supervised Referring Expression Segmentation
Sayan Nag
Koustava Goswami
Srikrishna Karanam
42
2
0
02 Jul 2024
Camera-LiDAR Cross-modality Gait Recognition
Camera-LiDAR Cross-modality Gait Recognition
Wenxuan Guo
Yingping Liang
Zhiyu Pan
Ziheng Xi
Jianjiang Feng
Jie Zhou
CVBM
25
3
0
02 Jul 2024
The Solution for the ICCV 2023 Perception Test Challenge 2023 -- Task 6
  -- Grounded videoQA
The Solution for the ICCV 2023 Perception Test Challenge 2023 -- Task 6 -- Grounded videoQA
Hailiang Zhang
Dian Chao
Zhihao Guan
Yang Yang
22
0
0
02 Jul 2024
Object Segmentation from Open-Vocabulary Manipulation Instructions Based
  on Optimal Transport Polygon Matching with Multimodal Foundation Models
Object Segmentation from Open-Vocabulary Manipulation Instructions Based on Optimal Transport Polygon Matching with Multimodal Foundation Models
Takayuki Nishimura
Katsuyuki Kuyo
Motonari Kambara
Komei Sugiura
DiffM
24
0
0
01 Jul 2024
EVF-SAM: Early Vision-Language Fusion for Text-Prompted Segment Anything Model
EVF-SAM: Early Vision-Language Fusion for Text-Prompted Segment Anything Model
Yuxuan Zhang
Tianheng Cheng
Lianghui Zhu
Lei Liu
Heng Liu
Longjin Ran
Xiaoxin Chen
Xiaoxin Chen
Wenyu Liu
Xinggang Wang
VLM
51
24
0
28 Jun 2024
Auto Cherry-Picker: Learning from High-quality Generative Data Driven by Language
Auto Cherry-Picker: Learning from High-quality Generative Data Driven by Language
Yicheng Chen
Xiangtai Li
Yining Li
Yanhong Zeng
Jianzong Wu
Xiangyu Zhao
Kai Chen
VLM
DiffM
56
3
0
28 Jun 2024
Lifelong Robot Library Learning: Bootstrapping Composable and
  Generalizable Skills for Embodied Control with Language Models
Lifelong Robot Library Learning: Bootstrapping Composable and Generalizable Skills for Embodied Control with Language Models
Georgios Tziafas
H. Kasaei
KELM
LM&Ro
23
8
0
26 Jun 2024
Towards Open-World Grasping with Large Vision-Language Models
Towards Open-World Grasping with Large Vision-Language Models
Georgios Tziafas
H. Kasaei
LM&Ro
LRM
27
11
0
26 Jun 2024
ScanFormer: Referring Expression Comprehension by Iteratively Scanning
ScanFormer: Referring Expression Comprehension by Iteratively Scanning
Wei Su
Peihan Miao
Huanzhang Dou
Xi Li
ObjD
26
7
0
26 Jun 2024
Revisiting Referring Expression Comprehension Evaluation in the Era of
  Large Multimodal Models
Revisiting Referring Expression Comprehension Evaluation in the Era of Large Multimodal Models
Jierun Chen
Fangyun Wei
Jinjing Zhao
Sizhe Song
Bohuai Wu
Zhuoxuan Peng
S.-H. Gary Chan
Hongyang R. Zhang
33
8
0
24 Jun 2024
LLARVA: Vision-Action Instruction Tuning Enhances Robot Learning
LLARVA: Vision-Action Instruction Tuning Enhances Robot Learning
Dantong Niu
Yuvan Sharma
Giscard Biamby
Jerome Quenum
Yutong Bai
Baifeng Shi
Trevor Darrell
Roei Herzig
LM&Ro
VLM
45
23
0
17 Jun 2024
A Survey on Text-guided 3D Visual Grounding: Elements, Recent Advances,
  and Future Directions
A Survey on Text-guided 3D Visual Grounding: Elements, Recent Advances, and Future Directions
Daizong Liu
Yang Liu
Wencan Huang
Wei Hu
LM&Ro
29
9
0
09 Jun 2024
One Perturbation is Enough: On Generating Universal Adversarial Perturbations against Vision-Language Pre-training Models
One Perturbation is Enough: On Generating Universal Adversarial Perturbations against Vision-Language Pre-training Models
Hao Fang
Jiawei Kong
Wenbo Yu
Bin Chen
Jiawei Li
Hao Wu
Ke Xu
Ke Xu
AAML
VLM
30
13
0
08 Jun 2024
Diffusion-Refined VQA Annotations for Semi-Supervised Gaze Following
Diffusion-Refined VQA Annotations for Semi-Supervised Gaze Following
Qiaomu Miao
Alexandros Graikos
Jingwei Zhang
Sounak Mondal
Minh Hoai
Dimitris Samaras
30
0
0
04 Jun 2024
Multi-layer Learnable Attention Mask for Multimodal Tasks
Multi-layer Learnable Attention Mask for Multimodal Tasks
Wayner Barrios
SouYoung Jin
34
0
0
04 Jun 2024
ELSA: Evaluating Localization of Social Activities in Urban Streets
ELSA: Evaluating Localization of Social Activities in Urban Streets
Maryam Hosseini
Marco Cipriano
Sedigheh Eslami
Daniel Hodczak
Liu Liu
Andres Sevtsuk
Gerard de Melo
26
0
0
03 Jun 2024
SAM as the Guide: Mastering Pseudo-Label Refinement in Semi-Supervised
  Referring Expression Segmentation
SAM as the Guide: Mastering Pseudo-Label Refinement in Semi-Supervised Referring Expression Segmentation
Danni Yang
Jiayi Ji
Yiwei Ma
Tianyu Guo
Haowei Wang
Xiaoshuai Sun
Rongrong Ji
ISeg
VLM
32
5
0
03 Jun 2024
Collaborative Novel Object Discovery and Box-Guided Cross-Modal
  Alignment for Open-Vocabulary 3D Object Detection
Collaborative Novel Object Discovery and Box-Guided Cross-Modal Alignment for Open-Vocabulary 3D Object Detection
Yang Cao
Yihan Zeng
Hang Xu
Dan Xu
3DPC
ObjD
28
6
0
02 Jun 2024
RTGen: Generating Region-Text Pairs for Open-Vocabulary Object Detection
RTGen: Generating Region-Text Pairs for Open-Vocabulary Object Detection
Fangyi Chen
Han Zhang
Zhantao Yang
Hao Chen
Kai Hu
Marios Savvides
ObjD
VLM
31
5
0
30 May 2024
Intent3D: 3D Object Detection in RGB-D Scans Based on Human Intention
Intent3D: 3D Object Detection in RGB-D Scans Based on Human Intention
Weitai Kang
Mengxue Qu
Jyoti Kini
Yunchao Wei
Mubarak Shah
Yan Yan
LM&Ro
3DPC
45
10
0
28 May 2024
LLM-Optic: Unveiling the Capabilities of Large Language Models for
  Universal Visual Grounding
LLM-Optic: Unveiling the Capabilities of Large Language Models for Universal Visual Grounding
Haoyu Zhao
Wenhang Ge
Ying-cong Chen
ObjD
MLLM
VLM
27
4
0
27 May 2024
VICtoR: Learning Hierarchical Vision-Instruction Correlation Rewards for Long-horizon Manipulation
VICtoR: Learning Hierarchical Vision-Instruction Correlation Rewards for Long-horizon Manipulation
Kuo-Han Hung
Pang-Chi Lo
Jia-Fong Yeh
Han-Yuan Hsu
Yi-Ting Chen
Winston H. Hsu
28
0
0
26 May 2024
V-Zen: Efficient GUI Understanding and Precise Grounding With A Novel
  Multimodal LLM
V-Zen: Efficient GUI Understanding and Precise Grounding With A Novel Multimodal LLM
Abdur Rahman
Rajat Chawla
Muskaan Kumar
Arkajit Datta
Adarsh Jha
NS Mukunda
Ishaan Bhola
40
2
0
24 May 2024
A Survey on Vision-Language-Action Models for Embodied AI
A Survey on Vision-Language-Action Models for Embodied AI
Yueen Ma
Zixing Song
Yuzheng Zhuang
Jianye Hao
Irwin King
LM&Ro
67
41
0
23 May 2024
Open-Vocabulary Spatio-Temporal Action Detection
Open-Vocabulary Spatio-Temporal Action Detection
Tao Wu
Shuqiu Ge
Jie Qin
Gangshan Wu
Limin Wang
ObjD
23
5
0
17 May 2024
Grounded 3D-LLM with Referent Tokens
Grounded 3D-LLM with Referent Tokens
Yilun Chen
Shuai Yang
Haifeng Huang
Tai Wang
Ruiyuan Lyu
Runsen Xu
Dahua Lin
Jiangmiao Pang
45
22
0
16 May 2024
Grounding DINO 1.5: Advance the "Edge" of Open-Set Object Detection
Grounding DINO 1.5: Advance the "Edge" of Open-Set Object Detection
Tianhe Ren
Qing Jiang
Shilong Liu
Zhaoyang Zeng
Wenlong Liu
...
Hao Zhang
Feng Li
Peijun Tang
Kent Yu
Lei Zhang
ObjD
VLM
29
33
0
16 May 2024
Spatial Semantic Recurrent Mining for Referring Image Segmentation
Spatial Semantic Recurrent Mining for Referring Image Segmentation
Jiaxing Yang
Lihe Zhang
Jiayu Sun
Huchuan Lu
21
0
0
15 May 2024
Language-Image Models with 3D Understanding
Language-Image Models with 3D Understanding
Jang Hyun Cho
B. Ivanovic
Yulong Cao
Edward Schmerling
Yue Wang
...
Boyi Li
Yurong You
Philipp Krahenbuhl
Yan Wang
Marco Pavone
LRM
40
16
0
06 May 2024
ScrewMimic: Bimanual Imitation from Human Videos with Screw Space
  Projection
ScrewMimic: Bimanual Imitation from Human Videos with Screw Space Projection
Arpit Bahety
Priyanka Mandikal
Ben Abbatematteo
Roberto Martín-Martín
25
13
0
06 May 2024
Transcrib3D: 3D Referring Expression Resolution through Large Language
  Models
Transcrib3D: 3D Referring Expression Resolution through Large Language Models
Jiading Fang
Xiangshan Tan
Shengjie Lin
Igor Vasiljevic
Vitor Campagnolo Guizilini
Hongyuan Mei
Rares Ambrus
Gregory Shakhnarovich
Matthew R. Walter
LM&Ro
33
4
0
30 Apr 2024
Q-GroundCAM: Quantifying Grounding in Vision Language Models via GradCAM
Q-GroundCAM: Quantifying Grounding in Vision Language Models via GradCAM
Navid Rajabi
Jana Kosecka
28
1
0
29 Apr 2024
Closed Loop Interactive Embodied Reasoning for Robot Manipulation
Closed Loop Interactive Embodied Reasoning for Robot Manipulation
Michal Nazarczuk
Jan Kristof Behrens
Karla Stepanova
Matej Hoffmann
K. Mikolajczyk
LM&Ro
LRM
36
1
0
23 Apr 2024
Lost in Space: Probing Fine-grained Spatial Understanding in Vision and
  Language Resamplers
Lost in Space: Probing Fine-grained Spatial Understanding in Vision and Language Resamplers
Georgios Pantazopoulos
Alessandro Suglia
Oliver Lemon
Arash Eshghi
VLM
21
4
0
21 Apr 2024
HiVG: Hierarchical Multimodal Fine-grained Modulation for Visual
  Grounding
HiVG: Hierarchical Multimodal Fine-grained Modulation for Visual Grounding
Linhui Xiao
Xiaoshan Yang
Fang Peng
Yaowei Wang
Changsheng Xu
ObjD
24
8
0
20 Apr 2024
Groma: Localized Visual Tokenization for Grounding Multimodal Large
  Language Models
Groma: Localized Visual Tokenization for Grounding Multimodal Large Language Models
Chuofan Ma
Yi-Xin Jiang
Jiannan Wu
Zehuan Yuan
Xiaojuan Qi
VLM
ObjD
37
51
0
19 Apr 2024
Zero-Shot Medical Phrase Grounding with Off-the-shelf Diffusion Models
Zero-Shot Medical Phrase Grounding with Off-the-shelf Diffusion Models
Konstantinos Vilouras
Pedro Sanchez
Alison Q. OÑeil
Sotirios A. Tsaftaris
MedIm
37
2
0
19 Apr 2024
MLS-Track: Multilevel Semantic Interaction in RMOT
MLS-Track: Multilevel Semantic Interaction in RMOT
Zeliang Ma
Yang Song
Zhe Cui
Zhicheng Zhao
Fei Su
Delong Liu
Jingyu Wang
34
4
0
18 Apr 2024
HOI-Ref: Hand-Object Interaction Referral in Egocentric Vision
HOI-Ref: Hand-Object Interaction Referral in Egocentric Vision
Siddhant Bansal
Michael Wray
Dima Damen
31
3
0
15 Apr 2024
DetCLIPv3: Towards Versatile Generative Open-vocabulary Object Detection
DetCLIPv3: Towards Versatile Generative Open-vocabulary Object Detection
Lewei Yao
Renjie Pi
Jianhua Han
Xiaodan Liang
Hang Xu
Wei Zhang
Zhenguo Li
Dan Xu
VLM
ObjD
37
19
0
14 Apr 2024
Enhancing Visual Question Answering through Question-Driven Image
  Captions as Prompts
Enhancing Visual Question Answering through Question-Driven Image Captions as Prompts
Övgü Özdemir
Erdem Akagündüz
36
10
0
12 Apr 2024
Ferret-v2: An Improved Baseline for Referring and Grounding with Large
  Language Models
Ferret-v2: An Improved Baseline for Referring and Grounding with Large Language Models
Haotian Zhang
Haoxuan You
Philipp Dufter
Bowen Zhang
Chen Chen
...
Tsu-jui Fu
William Yang Wang
Shih-Fu Chang
Zhe Gan
Yinfei Yang
ObjD
MLLM
99
44
0
11 Apr 2024
Learning to Localize Objects Improves Spatial Reasoning in Visual-LLMs
Learning to Localize Objects Improves Spatial Reasoning in Visual-LLMs
Kanchana Ranasinghe
Satya Narayan Shukla
Omid Poursaeed
Michael S. Ryoo
Tsung-Yu Lin
LRM
38
22
0
11 Apr 2024
Hyperbolic Learning with Synthetic Captions for Open-World Detection
Hyperbolic Learning with Synthetic Captions for Open-World Detection
Fanjie Kong
Yanbei Chen
Jiarui Cai
Davide Modolo
VLM
ObjD
31
7
0
07 Apr 2024
3DStyleGLIP: Part-Tailored Text-Guided 3D Neural Stylization
3DStyleGLIP: Part-Tailored Text-Guided 3D Neural Stylization
Seung-bum Chung
Joohyun Park
Hyewon Kan
Hyeongyeop Kang
CLIP
23
1
0
03 Apr 2024
Text-driven Affordance Learning from Egocentric Vision
Text-driven Affordance Learning from Egocentric Vision
Tomoya Yoshida
Shuhei Kurita
Taichi Nishimura
Shinsuke Mori
35
5
0
03 Apr 2024
LocCa: Visual Pretraining with Location-aware Captioners
LocCa: Visual Pretraining with Location-aware Captioners
Bo Wan
Michael Tschannen
Yongqin Xian
Filip Pavetić
Ibrahim M. Alabdulmohsin
Xiao Wang
André Susano Pinto
Andreas Steiner
Lucas Beyer
Xiao-Qi Zhai
VLM
40
6
0
28 Mar 2024
Previous
123456...111213
Next