ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2104.12763
  4. Cited By
MDETR -- Modulated Detection for End-to-End Multi-Modal Understanding

MDETR -- Modulated Detection for End-to-End Multi-Modal Understanding

26 April 2021
Aishwarya Kamath
Mannat Singh
Yann LeCun
Gabriel Synnaeve
Ishan Misra
Nicolas Carion
    ObjD
    VLM
ArXivPDFHTML

Papers citing "MDETR -- Modulated Detection for End-to-End Multi-Modal Understanding"

50 / 607 papers shown
Title
InstructDET: Diversifying Referring Object Detection with Generalized
  Instructions
InstructDET: Diversifying Referring Object Detection with Generalized Instructions
Ronghao Dang
Jiangyan Feng
Haodong Zhang
Chongjian Ge
Lin Song
...
Chengju Liu
Qi Chen
Feng Zhu
Rui Zhao
Yibing Song
ObjD
13
11
0
08 Oct 2023
Lightweight In-Context Tuning for Multimodal Unified Models
Lightweight In-Context Tuning for Multimodal Unified Models
Yixin Chen
Shuai Zhang
Boran Han
Jiaya Jia
13
2
0
08 Oct 2023
Expedited Training of Visual Conditioned Language Generation via
  Redundancy Reduction
Expedited Training of Visual Conditioned Language Generation via Redundancy Reduction
Yiren Jian
Tingkai Liu
Yunzhe Tao
Chunhui Zhang
Soroush Vosoughi
HX Yang
VLM
15
7
0
05 Oct 2023
CoDA: Collaborative Novel Box Discovery and Cross-modal Alignment for
  Open-vocabulary 3D Object Detection
CoDA: Collaborative Novel Box Discovery and Cross-modal Alignment for Open-vocabulary 3D Object Detection
Yang Cao
Yihan Zeng
Hang Xu
Dan Xu
3DPC
ObjD
8
33
0
04 Oct 2023
Unsupervised 3D Perception with 2D Vision-Language Distillation for
  Autonomous Driving
Unsupervised 3D Perception with 2D Vision-Language Distillation for Autonomous Driving
Mahyar Najibi
Jingwei Ji
Yin Zhou
C. Qi
Xinchen Yan
Scott Ettinger
Drago Anguelov
14
27
0
25 Sep 2023
CATR: Combinatorial-Dependence Audio-Queried Transformer for
  Audio-Visual Video Segmentation
CATR: Combinatorial-Dependence Audio-Queried Transformer for Audio-Visual Video Segmentation
Kexin Li
Zongxin Yang
Lei Chen
Yezhou Yang
Jun Xiao
VOS
37
50
0
18 Sep 2023
PRE: Vision-Language Prompt Learning with Reparameterization Encoder
PRE: Vision-Language Prompt Learning with Reparameterization Encoder
Anh Pham Thi Minh
An Duc Nguyen
Georgios Tzimiropoulos
VPVLM
VLM
17
3
0
14 Sep 2023
Beyond Generation: Harnessing Text to Image Models for Object Detection
  and Segmentation
Beyond Generation: Harnessing Text to Image Models for Object Detection and Segmentation
Yunhao Ge
Jiashu Xu
Brian Nlong Zhao
Neel Joshi
Laurent Itti
Vibhav Vineet
DiffM
30
14
0
12 Sep 2023
Multi3DRefer: Grounding Text Description to Multiple 3D Objects
Multi3DRefer: Grounding Text Description to Multiple 3D Objects
Yiming Zhang
ZeMing Gong
Angel X. Chang
45
63
0
11 Sep 2023
Language Prompt for Autonomous Driving
Language Prompt for Autonomous Driving
Dongming Wu
Wencheng Han
Tiancai Wang
Yingfei Liu
Cheng-zhong Xu
Jianbing Shen
Jianbing Shen
VLM
33
73
0
08 Sep 2023
Box-based Refinement for Weakly Supervised and Unsupervised Localization
  Tasks
Box-based Refinement for Weakly Supervised and Unsupervised Localization Tasks
Eyal Gomel
Tal Shaharabany
Lior Wolf
ObjD
18
3
0
07 Sep 2023
DetermiNet: A Large-Scale Diagnostic Dataset for Complex
  Visually-Grounded Referencing using Determiners
DetermiNet: A Large-Scale Diagnostic Dataset for Complex Visually-Grounded Referencing using Determiners
Clarence Lee
M Ganesh Kumar
Cheston Tan
28
3
0
07 Sep 2023
A Joint Study of Phrase Grounding and Task Performance in Vision and
  Language Models
A Joint Study of Phrase Grounding and Task Performance in Vision and Language Models
Noriyuki Kojima
Hadar Averbuch-Elor
Yoav Artzi
21
2
0
06 Sep 2023
Dense Object Grounding in 3D Scenes
Dense Object Grounding in 3D Scenes
Wencan Huang
Daizong Liu
Wei Hu
13
17
0
05 Sep 2023
CoTDet: Affordance Knowledge Prompting for Task Driven Object Detection
CoTDet: Affordance Knowledge Prompting for Task Driven Object Detection
Jiajin Tang
Ge Zheng
Jingyi Yu
Sibei Yang
ObjD
14
22
0
03 Sep 2023
Catalog Phrase Grounding (CPG): Grounding of Product Textual Attributes
  in Product Images for e-commerce Vision-Language Applications
Catalog Phrase Grounding (CPG): Grounding of Product Textual Attributes in Product Images for e-commerce Vision-Language Applications
Wenyi Wu
Karim Bouyarmane
Ismail B. Tutar
23
2
0
30 Aug 2023
GREC: Generalized Referring Expression Comprehension
GREC: Generalized Referring Expression Comprehension
Shuting He
Henghui Ding
Chang Liu
Xudong Jiang
ObjD
19
14
0
30 Aug 2023
Exploring Multi-Modal Contextual Knowledge for Open-Vocabulary Object
  Detection
Exploring Multi-Modal Contextual Knowledge for Open-Vocabulary Object Detection
Yifan Xu
Mengdan Zhang
Xiaoshan Yang
Changsheng Xu
ObjD
19
5
0
30 Aug 2023
Shatter and Gather: Learning Referring Image Segmentation with Text
  Supervision
Shatter and Gather: Learning Referring Image Segmentation with Text Supervision
Dongwon Kim
Nam-Won Kim
Cuiling Lan
Suha Kwak
VLM
26
19
0
29 Aug 2023
UniPT: Universal Parallel Tuning for Transfer Learning with Efficient
  Parameter and Memory
UniPT: Universal Parallel Tuning for Transfer Learning with Efficient Parameter and Memory
Haiwen Diao
Bo Wan
Y. Zhang
Xuecong Jia
Huchuan Lu
Long Chen
VLM
31
17
0
28 Aug 2023
Towards Unified Token Learning for Vision-Language Tracking
Towards Unified Token Learning for Vision-Language Tracking
Yaozong Zheng
Bineng Zhong
Qihua Liang
Guorong Li
R. Ji
Xianxian Li
19
28
0
27 Aug 2023
Beyond One-to-One: Rethinking the Referring Image Segmentation
Beyond One-to-One: Rethinking the Referring Image Segmentation
Yutao Hu
Qixiong Wang
Wenqi Shao
Enze Xie
Zhenguo Li
Jungong Han
Ping Luo
3DV
14
37
0
26 Aug 2023
Position-Enhanced Visual Instruction Tuning for Multimodal Large
  Language Models
Position-Enhanced Visual Instruction Tuning for Multimodal Large Language Models
Chi Chen
Ruoyu Qin
Fuwen Luo
Xiaoyue Mi
Peng Li
Maosong Sun
Yang Liu
MLLM
VLM
14
45
0
25 Aug 2023
How to Evaluate the Generalization of Detection? A Benchmark for
  Comprehensive Open-Vocabulary Detection
How to Evaluate the Generalization of Detection? A Benchmark for Comprehensive Open-Vocabulary Detection
Yi Yao
Peng Liu
Tiancheng Zhao
Qianqian Zhang
Jiajia Liao
Chunxin Fang
Kyusong Lee
Qing Wang
VLM
ObjD
17
12
0
25 Aug 2023
SCoRD: Subject-Conditional Relation Detection with Text-Augmented Data
SCoRD: Subject-Conditional Relation Detection with Text-Augmented Data
Ziyan Yang
Kushal Kafle
Zhe-nan Lin
Scott D. Cohen
Zhihong Ding
Vicente Ordonez
19
1
0
24 Aug 2023
Grounded Entity-Landmark Adaptive Pre-training for Vision-and-Language
  Navigation
Grounded Entity-Landmark Adaptive Pre-training for Vision-and-Language Navigation
Yibo Cui
Liang Xie
Yakun Zhang
Meishan Zhang
Ye Yan
Erwei Yin
LM&Ro
29
16
0
24 Aug 2023
HuBo-VLM: Unified Vision-Language Model designed for HUman roBOt
  interaction tasks
HuBo-VLM: Unified Vision-Language Model designed for HUman roBOt interaction tasks
Zichao Dong
Weikun Zhang
Xufeng Huang
Hang Ji
Xin Zhan
Junbo Chen
VLM
17
4
0
24 Aug 2023
RefEgo: Referring Expression Comprehension Dataset from First-Person
  Perception of Ego4D
RefEgo: Referring Expression Comprehension Dataset from First-Person Perception of Ego4D
Shuhei Kurita
Naoki Katsura
Eri Onami
EgoV
18
14
0
23 Aug 2023
Deep Metric Loss for Multimodal Learning
Deep Metric Loss for Multimodal Learning
Sehwan Moon
Hyun-Yong Lee
14
0
0
21 Aug 2023
Towards Grounded Visual Spatial Reasoning in Multi-Modal Vision Language
  Models
Towards Grounded Visual Spatial Reasoning in Multi-Modal Vision Language Models
Navid Rajabi
Jana Kosecka
VLM
16
11
0
18 Aug 2023
RLIPv2: Fast Scaling of Relational Language-Image Pre-training
RLIPv2: Fast Scaling of Relational Language-Image Pre-training
Hangjie Yuan
Shiwei Zhang
Xiang Wang
Samuel Albanie
Yining Pan
Tao Feng
Jianwen Jiang
Dong Ni
Yingya Zhang
Deli Zhao
VLM
14
37
0
18 Aug 2023
Tem-adapter: Adapting Image-Text Pretraining for Video Question Answer
Tem-adapter: Adapting Image-Text Pretraining for Video Question Answer
Guangyi Chen
Xiao Liu
Guangrun Wang
Kun Zhang
Philip H.S.Torr
Xiaoping Zhang
Yansong Tang
19
18
0
16 Aug 2023
Helping Hands: An Object-Aware Ego-Centric Video Recognition Model
Helping Hands: An Object-Aware Ego-Centric Video Recognition Model
Chuhan Zhang
Ankush Gupta
Andrew Zisserman
VLM
21
19
0
15 Aug 2023
Taming Self-Training for Open-Vocabulary Object Detection
Taming Self-Training for Open-Vocabulary Object Detection
Shiyu Zhao
S. Schulter
Long Zhao
Zhixing Zhang
Vijay Kumar B.G
Yumin Suh
Manmohan Chandraker
Dimitris N. Metaxas
VLM
ObjD
30
12
0
11 Aug 2023
Exploring Visual Pre-training for Robot Manipulation: Datasets, Models
  and Methods
Exploring Visual Pre-training for Robot Manipulation: Datasets, Models and Methods
Ya Jing
Xuelin Zhu
Xingbin Liu
Qie Sima
Taozheng Yang
Yunhai Feng
Tao Kong
LM&Ro
25
16
0
07 Aug 2023
The All-Seeing Project: Towards Panoptic Visual Recognition and
  Understanding of the Open World
The All-Seeing Project: Towards Panoptic Visual Recognition and Understanding of the Open World
Weiyun Wang
Min Shi
Qingyun Li
Wen Wang
Zhenhang Huang
...
Zhiguo Cao
Yushi Chen
Tong Lu
Jifeng Dai
Yu Qiao
LRM
MLLM
33
83
0
03 Aug 2023
Grounded Image Text Matching with Mismatched Relation Reasoning
Grounded Image Text Matching with Mismatched Relation Reasoning
Yu Wu
Yan-Tao Wei
Haozhe Jasper Wang
Yongfei Liu
Sibei Yang
Xuming He
21
6
0
02 Aug 2023
Towards General Visual-Linguistic Face Forgery Detection
Towards General Visual-Linguistic Face Forgery Detection
Ke Sun
Shen Chen
Taiping Yao
Haozhe Yang
Xiaoshuai Sun
Shouhong Ding
R. Ji
19
12
0
31 Jul 2023
Bridging the Gap: Exploring the Capabilities of Bridge-Architectures for
  Complex Visual Reasoning Tasks
Bridging the Gap: Exploring the Capabilities of Bridge-Architectures for Complex Visual Reasoning Tasks
Kousik Rajesh
Mrigank Raman
M. A. Karim
Pranit Chawla
VLM
23
2
0
31 Jul 2023
JOTR: 3D Joint Contrastive Learning with Transformers for Occluded Human
  Mesh Recovery
JOTR: 3D Joint Contrastive Learning with Transformers for Occluded Human Mesh Recovery
Jiahao Li
Zongxin Yang
Xiaohan Wang
Jianxin Ma
Chang Zhou
Yi Yang
27
11
0
31 Jul 2023
UnIVAL: Unified Model for Image, Video, Audio and Language Tasks
UnIVAL: Unified Model for Image, Video, Audio and Language Tasks
Mustafa Shukor
Corentin Dancette
Alexandre Ramé
Matthieu Cord
MoMe
MLLM
30
42
0
30 Jul 2023
Scaling Up and Distilling Down: Language-Guided Robot Skill Acquisition
Scaling Up and Distilling Down: Language-Guided Robot Skill Acquisition
Huy Ha
Peter R. Florence
Shuran Song
LM&Ro
34
148
0
26 Jul 2023
Foundational Models Defining a New Era in Vision: A Survey and Outlook
Foundational Models Defining a New Era in Vision: A Survey and Outlook
Muhammad Awais
Muzammal Naseer
Salman Khan
Rao Muhammad Anwer
Hisham Cholakkal
M. Shah
Ming Yang
F. Khan
VLM
18
117
0
25 Jul 2023
3DRP-Net: 3D Relative Position-aware Network for 3D Visual Grounding
3DRP-Net: 3D Relative Position-aware Network for 3D Visual Grounding
Zehan Wang
Haifeng Huang
Yang Zhao
Lin Li
Xize Cheng
Yichen Zhu
Aoxiong Yin
Zhou Zhao
3DPC
25
20
0
25 Jul 2023
Audio-aware Query-enhanced Transformer for Audio-Visual Segmentation
Audio-aware Query-enhanced Transformer for Audio-Visual Segmentation
Jinxian Liu
Chen Ju
Chaofan Ma
Yanfeng Wang
Yu Wang
Ya-Qin Zhang
VOS
16
23
0
25 Jul 2023
Described Object Detection: Liberating Object Detection with Flexible
  Expressions
Described Object Detection: Liberating Object Detection with Flexible Expressions
Chi Xie
Zhao Zhang
YiXuan Wu
Feng Zhu
Rui Zhao
Shuang Liang
ObjD
32
30
0
24 Jul 2023
Iterative Robust Visual Grounding with Masked Reference based
  Centerpoint Supervision
Iterative Robust Visual Grounding with Masked Reference based Centerpoint Supervision
Menghao Li
Chunlei Wang
W. Feng
Shuchang Lyu
Guangliang Cheng
Xiangtai Li
Binghao Liu
Qi Zhao
28
5
0
23 Jul 2023
Advancing Visual Grounding with Scene Knowledge: Benchmark and Method
Advancing Visual Grounding with Scene Knowledge: Benchmark and Method
Zhihong Chen
Ruifei Zhang
Yibing Song
Xiang Wan
Guanbin Li
20
15
0
21 Jul 2023
Bridging Vision and Language Encoders: Parameter-Efficient Tuning for
  Referring Image Segmentation
Bridging Vision and Language Encoders: Parameter-Efficient Tuning for Referring Image Segmentation
Zunnan Xu
Zhihong Chen
Yong Zhang
Yibing Song
Xiang Wan
Guanbin Li
VLM
17
47
0
21 Jul 2023
Divert More Attention to Vision-Language Object Tracking
Divert More Attention to Vision-Language Object Tracking
Mingzhe Guo
Zhipeng Zhang
Li Jing
Haibin Ling
Heng Fan
VLM
22
3
0
19 Jul 2023
Previous
123...567...111213
Next