ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2104.12763
  4. Cited By
MDETR -- Modulated Detection for End-to-End Multi-Modal Understanding

MDETR -- Modulated Detection for End-to-End Multi-Modal Understanding

26 April 2021
Aishwarya Kamath
Mannat Singh
Yann LeCun
Gabriel Synnaeve
Ishan Misra
Nicolas Carion
    ObjD
    VLM
ArXivPDFHTML

Papers citing "MDETR -- Modulated Detection for End-to-End Multi-Modal Understanding"

50 / 607 papers shown
Title
ClipCrop: Conditioned Cropping Driven by Vision-Language Model
ClipCrop: Conditioned Cropping Driven by Vision-Language Model
Zhihang Zhong
Mingxi Cheng
Zhirong Wu
Yuhui Yuan
Yinqiang Zheng
Ji Li
Han Hu
Stephen Lin
Yoichi Sato
Imari Sato
VLM
CLIP
25
3
0
21 Nov 2022
Unifying Tracking and Image-Video Object Detection
Unifying Tracking and Image-Video Object Detection
Peirong Liu
Rui Wang
Pengchuan Zhang
Omid Poursaeed
Yipin Zhou
Xuefei Cao
Sreya . Dutta Roy
Ashish Shah
Ser-Nam Lim
11
0
0
20 Nov 2022
Leveraging per Image-Token Consistency for Vision-Language Pre-training
Leveraging per Image-Token Consistency for Vision-Language Pre-training
Yunhao Gou
Tom Ko
Hansi Yang
James T. Kwok
Yu Zhang
Mingxuan Wang
VLM
16
9
0
20 Nov 2022
Language Conditioned Spatial Relation Reasoning for 3D Object Grounding
Language Conditioned Spatial Relation Reasoning for 3D Object Grounding
Shizhe Chen
Pierre-Louis Guhur
Makarand Tapaswi
Cordelia Schmid
Ivan Laptev
43
74
0
17 Nov 2022
A Unified Mutual Supervision Framework for Referring Expression
  Segmentation and Generation
A Unified Mutual Supervision Framework for Referring Expression Segmentation and Generation
Shijia Huang
Feng Li
Hao Zhang
Siyi Liu
Lei Zhang
Liwei Wang
22
5
0
15 Nov 2022
YORO -- Lightweight End to End Visual Grounding
YORO -- Lightweight End to End Visual Grounding
Chih-Hui Ho
Srikar Appalaraju
Bhavan A. Jasani
R. Manmatha
Nuno Vasconcelos
ObjD
21
21
0
15 Nov 2022
Grounding Scene Graphs on Natural Images via Visio-Lingual Message
  Passing
Grounding Scene Graphs on Natural Images via Visio-Lingual Message Passing
Aditay Tripathi
Anand Mishra
Anirban Chakraborty
17
2
0
03 Nov 2022
VLT: Vision-Language Transformer and Query Generation for Referring
  Segmentation
VLT: Vision-Language Transformer and Query Generation for Referring Segmentation
Henghui Ding
Chang Liu
Suchen Wang
Xudong Jiang
66
115
0
28 Oct 2022
Towards Unifying Reference Expression Generation and Comprehension
Towards Unifying Reference Expression Generation and Comprehension
Duo Zheng
Tao Kong
Ya Jing
Jiaan Wang
Xiaojie Wang
ObjD
27
6
0
24 Oct 2022
Extending Phrase Grounding with Pronouns in Visual Dialogues
Extending Phrase Grounding with Pronouns in Visual Dialogues
Panzhong Lu
Xin Zhang
Meishan Zhang
Min Zhang
ObjD
22
4
0
23 Oct 2022
Learning Point-Language Hierarchical Alignment for 3D Visual Grounding
Learning Point-Language Hierarchical Alignment for 3D Visual Grounding
Jiaming Chen
Weihua Luo
Ran Song
Xiaolin K. Wei
Lin Ma
Wei Emma Zhang
3DV
38
6
0
22 Oct 2022
TOIST: Task Oriented Instance Segmentation Transformer with Noun-Pronoun
  Distillation
TOIST: Task Oriented Instance Segmentation Transformer with Noun-Pronoun Distillation
Pengfei Li
Beiwen Tian
Yongliang Shi
Xiaoxue Chen
Hao Zhao
Guyue Zhou
Ya-Qin Zhang
31
19
0
19 Oct 2022
LVP-M3: Language-aware Visual Prompt for Multilingual Multimodal Machine
  Translation
LVP-M3: Language-aware Visual Prompt for Multilingual Multimodal Machine Translation
Hongcheng Guo
Jiaheng Liu
Haoyang Huang
Jian Yang
Zhoujun Li
Dongdong Zhang
Zheng Cui
Furu Wei
37
22
0
19 Oct 2022
Perceptual Grouping in Contrastive Vision-Language Models
Perceptual Grouping in Contrastive Vision-Language Models
Kanchana Ranasinghe
Brandon McKinzie
S. S. Ravi
Yinfei Yang
Alexander Toshev
Jonathon Shlens
VLM
22
51
0
18 Oct 2022
How to Train Vision Transformer on Small-scale Datasets?
How to Train Vision Transformer on Small-scale Datasets?
Hanan Gani
Muzammal Naseer
Mohammad Yaqub
ViT
12
49
0
13 Oct 2022
Visual Classification via Description from Large Language Models
Visual Classification via Description from Large Language Models
Sachit Menon
Carl Vondrick
VLM
11
287
0
13 Oct 2022
One does not fit all! On the Complementarity of Vision Encoders for
  Vision and Language Tasks
One does not fit all! On the Complementarity of Vision Encoders for Vision and Language Tasks
Gregor Geigle
Chen Cecilia Liu
Jonas Pfeiffer
Iryna Gurevych
VLM
24
1
0
12 Oct 2022
Visual Language Maps for Robot Navigation
Visual Language Maps for Robot Navigation
Chen Huang
Oier Mees
Andy Zeng
Wolfram Burgard
LM&Ro
145
343
0
11 Oct 2022
Understanding Embodied Reference with Touch-Line Transformer
Understanding Embodied Reference with Touch-Line Transformer
Y. Li
Xiaoxue Chen
Hao Zhao
Jiangtao Gong
Guyue Zhou
Federico Rossano
Yixin Zhu
158
15
0
11 Oct 2022
MAP: Multimodal Uncertainty-Aware Vision-Language Pre-training Model
MAP: Multimodal Uncertainty-Aware Vision-Language Pre-training Model
Yatai Ji
Junjie Wang
Yuan Gong
Lin Zhang
Yan Zhu
Hongfa Wang
Jiaxing Zhang
Tetsuya Sakai
Yujiu Yang
MLLM
19
29
0
11 Oct 2022
VoLTA: Vision-Language Transformer with Weakly-Supervised Local-Feature
  Alignment
VoLTA: Vision-Language Transformer with Weakly-Supervised Local-Feature Alignment
Shraman Pramanick
Li Jing
Sayan Nag
Jiachen Zhu
Hardik Shah
Yann LeCun
Ramalingam Chellappa
24
21
0
09 Oct 2022
A New Path: Scaling Vision-and-Language Navigation with Synthetic
  Instructions and Imitation Learning
A New Path: Scaling Vision-and-Language Navigation with Synthetic Instructions and Imitation Learning
Aishwarya Kamath
Peter Anderson
Su Wang
Jing Yu Koh
Alexander Ku
Austin Waters
Yinfei Yang
Jason Baldridge
Zarana Parekh
LM&Ro
15
45
0
06 Oct 2022
Video Referring Expression Comprehension via Transformer with
  Content-aware Query
Video Referring Expression Comprehension via Transformer with Content-aware Query
Ji Jiang
Meng Cao
Tengtao Song
Yuexian Zou
19
5
0
06 Oct 2022
Vision+X: A Survey on Multimodal Learning in the Light of Data
Vision+X: A Survey on Multimodal Learning in the Light of Data
Ye Zhu
Yuehua Wu
N. Sebe
Yan Yan
33
16
0
05 Oct 2022
PLOT: Prompt Learning with Optimal Transport for Vision-Language Models
PLOT: Prompt Learning with Optimal Transport for Vision-Language Models
Guangyi Chen
Weiran Yao
Xiangchen Song
Xinyue Li
Yongming Rao
Kun Zhang
VPVLM
VLM
6
62
0
03 Oct 2022
Introducing Vision Transformer for Alzheimer's Disease classification
  task with 3D input
Introducing Vision Transformer for Alzheimer's Disease classification task with 3D input
Zilun Zhang
Farzad Khalvati
MedIm
ViT
14
9
0
03 Oct 2022
EDA: Explicit Text-Decoupling and Dense Alignment for 3D Visual
  Grounding
EDA: Explicit Text-Decoupling and Dense Alignment for 3D Visual Grounding
Yanmin Wu
Xinhua Cheng
Renrui Zhang
Zesen Cheng
Jian Zhang
51
62
0
29 Sep 2022
Dynamic MDETR: A Dynamic Multimodal Transformer Decoder for Visual
  Grounding
Dynamic MDETR: A Dynamic Multimodal Transformer Decoder for Visual Grounding
Fengyuan Shi
Ruopeng Gao
Weilin Huang
Limin Wang
17
23
0
28 Sep 2022
Obj2Seq: Formatting Objects as Sequences with Class Prompt for Visual
  Tasks
Obj2Seq: Formatting Objects as Sequences with Class Prompt for Visual Tasks
Zhiyang Chen
Yousong Zhu
Zhaowen Li
Fan Yang
Wei Li
...
Chaoyang Zhao
Liwei Wu
Rui Zhao
Jinqiao Wang
Ming Tang
VLM
VOS
59
15
0
28 Sep 2022
Embracing Consistency: A One-Stage Approach for Spatio-Temporal Video
  Grounding
Embracing Consistency: A One-Stage Approach for Spatio-Temporal Video Grounding
Yang Jin
Yongzhi Li
Zehuan Yuan
Yadong Mu
29
32
0
27 Sep 2022
Exploring Modulated Detection Transformer as a Tool for Action
  Recognition in Videos
Exploring Modulated Detection Transformer as a Tool for Action Recognition in Videos
Tomás Crisol
J. Ermantraut
Adrián Rostagno
Santiago L. Aggio
Javier Iparraguirre
16
0
0
21 Sep 2022
Towards Robust Referring Image Segmentation
Towards Robust Referring Image Segmentation
Jianzong Wu
Xiangtai Li
Xia Li
Henghui Ding
Yu Tong
Dacheng Tao
3DV
29
40
0
20 Sep 2022
DetCLIP: Dictionary-Enriched Visual-Concept Paralleled Pre-training for
  Open-world Detection
DetCLIP: Dictionary-Enriched Visual-Concept Paralleled Pre-training for Open-world Detection
Lewei Yao
Jianhua Han
Youpeng Wen
Xiaodan Liang
Dan Xu
Wei Zhang
Zhenguo Li
Chunjing Xu
Hang Xu
CLIP
VLM
115
152
0
20 Sep 2022
Test-Time Prompt Tuning for Zero-Shot Generalization in Vision-Language
  Models
Test-Time Prompt Tuning for Zero-Shot Generalization in Vision-Language Models
Manli Shu
Weili Nie
De-An Huang
Zhiding Yu
Tom Goldstein
Anima Anandkumar
Chaowei Xiao
VLM
VPVLM
178
280
0
15 Sep 2022
Perceiver-Actor: A Multi-Task Transformer for Robotic Manipulation
Perceiver-Actor: A Multi-Task Transformer for Robotic Manipulation
Mohit Shridhar
Lucas Manuelli
D. Fox
LM&Ro
155
456
0
12 Sep 2022
OmDet: Large-scale vision-language multi-dataset pre-training with
  multimodal detection network
OmDet: Large-scale vision-language multi-dataset pre-training with multimodal detection network
Tiancheng Zhao
Peng Liu
Kyusong Lee
VLM
MLLM
ObjD
19
7
0
10 Sep 2022
RLIP: Relational Language-Image Pre-training for Human-Object
  Interaction Detection
RLIP: Relational Language-Image Pre-training for Human-Object Interaction Detection
Hangjie Yuan
Jianwen Jiang
Samuel Albanie
Tao Feng
Ziyuan Huang
Dong Ni
Mingqian Tang
VLM
26
51
0
05 Sep 2022
Towards Efficient Use of Multi-Scale Features in Transformer-Based
  Object Detectors
Towards Efficient Use of Multi-Scale Features in Transformer-Based Object Detectors
Gongjie Zhang
Zhipeng Luo
Zichen Tian
Yingchen Yu
Jingyi Zhang
Shijian Lu
30
26
0
24 Aug 2022
Open-Vocabulary Universal Image Segmentation with MaskCLIP
Open-Vocabulary Universal Image Segmentation with MaskCLIP
Zheng Ding
Jieke Wang
Z. Tu
CLIP
ISeg
VLM
41
85
0
18 Aug 2022
What Artificial Neural Networks Can Tell Us About Human Language
  Acquisition
What Artificial Neural Networks Can Tell Us About Human Language Acquisition
Alex Warstadt
Samuel R. Bowman
11
111
0
17 Aug 2022
PPMN: Pixel-Phrase Matching Network for One-Stage Panoptic Narrative
  Grounding
PPMN: Pixel-Phrase Matching Network for One-Stage Panoptic Narrative Grounding
Zihan Ding
Zixiang Ding
Tianrui Hui
Junshi Huang
Xiaoming Wei
Xiaolin K. Wei
Si Liu
12
12
0
11 Aug 2022
Masked Vision and Language Modeling for Multi-modal Representation
  Learning
Masked Vision and Language Modeling for Multi-modal Representation Learning
Gukyeong Kwon
Zhaowei Cai
Avinash Ravichandran
Erhan Bas
Rahul Bhotika
Stefano Soatto
22
67
0
03 Aug 2022
One for All: One-stage Referring Expression Comprehension with Dynamic
  Reasoning
One for All: One-stage Referring Expression Comprehension with Dynamic Reasoning
Zhipeng Zhang
Zhimin Wei
Zhongzhen Huang
Rui Niu
Peng Wang
ObjD
LRM
9
9
0
31 Jul 2022
Fine-grained Retrieval Prompt Tuning
Fine-grained Retrieval Prompt Tuning
Shijie Wang
Jianlong Chang
Zhihui Wang
Haojie Li
Wanli Ouyang
Qi Tian
VLM
VPVLM
11
15
0
29 Jul 2022
Pro-tuning: Unified Prompt Tuning for Vision Tasks
Pro-tuning: Unified Prompt Tuning for Vision Tasks
Xing Nie
Bolin Ni
Jianlong Chang
Gaomeng Meng
Chunlei Huo
Zhaoxiang Zhang
Shiming Xiang
Qi Tian
Chunhong Pan
AAML
VPVLM
VLM
19
69
0
28 Jul 2022
SiRi: A Simple Selective Retraining Mechanism for Transformer-based
  Visual Grounding
SiRi: A Simple Selective Retraining Mechanism for Transformer-based Visual Grounding
Mengxue Qu
Yu Wu
Wu Liu
Qiqi Gong
Xiaodan Liang
Olga Russakovsky
Yao Zhao
Yunchao Wei
ObjD
11
22
0
27 Jul 2022
DETRs with Hybrid Matching
DETRs with Hybrid Matching
Ding Jia
Yuhui Yuan
Hao He
Xiao-pei Wu
Haojun Yu
Weihong Lin
Lei-huan Sun
Chao Zhang
Hanhua Hu
21
182
0
26 Jul 2022
Multi-Attention Network for Compressed Video Referring Object
  Segmentation
Multi-Attention Network for Compressed Video Referring Object Segmentation
Weidong Chen
Dexiang Hong
Yuankai Qi
Zhenjun Han
Shuhui Wang
Laiyun Qing
Qingming Huang
Guorong Li
VOS
18
35
0
26 Jul 2022
Correspondence Matters for Video Referring Expression Comprehension
Correspondence Matters for Video Referring Expression Comprehension
Meng Cao
Ji Jiang
Long Chen
Yuexian Zou
VOS
17
20
0
21 Jul 2022
Exploiting Unlabeled Data with Vision and Language Models for Object
  Detection
Exploiting Unlabeled Data with Vision and Language Models for Object Detection
Shiyu Zhao
Zhixing Zhang
S. Schulter
Long Zhao
Vijay Kumar B.G
Anastasis Stathopoulos
Manmohan Chandraker
Dimitris N. Metaxas
VLM
ObjD
19
100
0
18 Jul 2022
Previous
123...101112139
Next