ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2104.12763
  4. Cited By
MDETR -- Modulated Detection for End-to-End Multi-Modal Understanding
v1v2 (latest)

MDETR -- Modulated Detection for End-to-End Multi-Modal Understanding

IEEE International Conference on Computer Vision (ICCV), 2021
26 April 2021
Aishwarya Kamath
Mannat Singh
Yann LeCun
Gabriel Synnaeve
Ishan Misra
Nicolas Carion
    ObjDVLM
ArXiv (abs)PDFHTMLGithub (1008★)

Papers citing "MDETR -- Modulated Detection for End-to-End Multi-Modal Understanding"

50 / 671 papers shown
Title
Perceiver-Actor: A Multi-Task Transformer for Robotic Manipulation
Perceiver-Actor: A Multi-Task Transformer for Robotic ManipulationConference on Robot Learning (CoRL), 2022
Mohit Shridhar
Lucas Manuelli
Dieter Fox
LM&Ro
511
651
0
12 Sep 2022
OmDet: Large-scale vision-language multi-dataset pre-training with
  multimodal detection network
OmDet: Large-scale vision-language multi-dataset pre-training with multimodal detection networkIET Computer Vision (ICV), 2022
Tiancheng Zhao
Peng Liu
Kyusong Lee
VLMMLLMObjD
114
13
0
10 Sep 2022
RLIP: Relational Language-Image Pre-training for Human-Object
  Interaction Detection
RLIP: Relational Language-Image Pre-training for Human-Object Interaction DetectionNeural Information Processing Systems (NeurIPS), 2022
Hangjie Yuan
Jianwen Jiang
Samuel Albanie
Tao Feng
Ziyuan Huang
Dong Ni
Mingqian Tang
VLM
275
73
0
05 Sep 2022
Injecting Image Details into CLIP's Feature Space
Injecting Image Details into CLIP's Feature Space
Zilun Zhang
Cuifeng Shen
Yuan-Chung Shen
Huixin Xiong
Xinyu Zhou
VLMCLIP
167
0
0
31 Aug 2022
Towards Efficient Use of Multi-Scale Features in Transformer-Based
  Object Detectors
Towards Efficient Use of Multi-Scale Features in Transformer-Based Object DetectorsComputer Vision and Pattern Recognition (CVPR), 2022
Gongjie Zhang
Zhipeng Luo
Zichen Tian
Yingchen Yu
Jingyi Zhang
Shijian Lu
228
40
0
24 Aug 2022
Open-Vocabulary Universal Image Segmentation with MaskCLIP
Open-Vocabulary Universal Image Segmentation with MaskCLIPInternational Conference on Machine Learning (ICML), 2022
Zheng Ding
Jieke Wang
Zhuowen Tu
CLIPISegVLM
228
124
0
18 Aug 2022
What Artificial Neural Networks Can Tell Us About Human Language
  Acquisition
What Artificial Neural Networks Can Tell Us About Human Language Acquisition
Alex Warstadt
Samuel R. Bowman
209
133
0
17 Aug 2022
PPMN: Pixel-Phrase Matching Network for One-Stage Panoptic Narrative
  Grounding
PPMN: Pixel-Phrase Matching Network for One-Stage Panoptic Narrative GroundingACM Multimedia (ACM MM), 2022
Zihan Ding
Zixiang Ding
Tianrui Hui
Junshi Huang
Xiaoming Wei
Xiaolin K. Wei
Si Liu
162
15
0
11 Aug 2022
Masked Vision and Language Modeling for Multi-modal Representation
  Learning
Masked Vision and Language Modeling for Multi-modal Representation LearningInternational Conference on Learning Representations (ICLR), 2022
Gukyeong Kwon
Zhaowei Cai
Avinash Ravichandran
Erhan Bas
Rahul Bhotika
Stefano Soatto
187
82
0
03 Aug 2022
One for All: One-stage Referring Expression Comprehension with Dynamic
  Reasoning
One for All: One-stage Referring Expression Comprehension with Dynamic ReasoningNeurocomputing (Neurocomputing), 2022
Zhipeng Zhang
Zhimin Wei
Zhongzhen Huang
Rui Niu
Peng Wang
ObjDLRM
212
9
0
31 Jul 2022
Fine-grained Retrieval Prompt Tuning
Fine-grained Retrieval Prompt TuningAAAI Conference on Artificial Intelligence (AAAI), 2022
Shijie Wang
Jianlong Chang
Zhihui Wang
Haojie Li
Wanli Ouyang
Qi Tian
VLMVPVLM
134
23
0
29 Jul 2022
Pro-tuning: Unified Prompt Tuning for Vision Tasks
Pro-tuning: Unified Prompt Tuning for Vision Tasks
Xing Nie
Bolin Ni
Jianlong Chang
Gaomeng Meng
Chunlei Huo
Zhaoxiang Zhang
Shiming Xiang
Qi Tian
Chunhong Pan
AAMLVPVLMVLM
330
95
0
28 Jul 2022
SiRi: A Simple Selective Retraining Mechanism for Transformer-based
  Visual Grounding
SiRi: A Simple Selective Retraining Mechanism for Transformer-based Visual GroundingEuropean Conference on Computer Vision (ECCV), 2022
Mengxue Qu
Yu Wu
Wu Liu
Qiqi Gong
Xiaodan Liang
Olga Russakovsky
Yao Zhao
Yunchao Wei
ObjD
105
26
0
27 Jul 2022
DETRs with Hybrid Matching
DETRs with Hybrid MatchingComputer Vision and Pattern Recognition (CVPR), 2022
Ding Jia
Yuhui Yuan
Hao He
Xiao-pei Wu
Haojun Yu
Weihong Lin
Lei-huan Sun
Chao Zhang
Hanhua Hu
328
247
0
26 Jul 2022
Multi-Attention Network for Compressed Video Referring Object
  Segmentation
Multi-Attention Network for Compressed Video Referring Object SegmentationACM Multimedia (ACM MM), 2022
Weidong Chen
Dexiang Hong
Yuankai Qi
Zhenjun Han
Shuhui Wang
Laiyun Qing
Qingming Huang
Guorong Li
VOS
120
51
0
26 Jul 2022
Correspondence Matters for Video Referring Expression Comprehension
Correspondence Matters for Video Referring Expression ComprehensionACM Multimedia (ACM MM), 2022
Meng Cao
Ji Jiang
Long Chen
Yuexian Zou
VOS
249
21
0
21 Jul 2022
Exploiting Unlabeled Data with Vision and Language Models for Object
  Detection
Exploiting Unlabeled Data with Vision and Language Models for Object DetectionEuropean Conference on Computer Vision (ECCV), 2022
Shiyu Zhao
Zhixing Zhang
S. Schulter
Long Zhao
Vijay Kumar B.G
Anastasis Stathopoulos
Manmohan Chandraker
Dimitris N. Metaxas
VLMObjD
155
121
0
18 Jul 2022
3D Concept Grounding on Neural Fields
3D Concept Grounding on Neural FieldsNeural Information Processing Systems (NeurIPS), 2022
Yining Hong
Yilun Du
Chun-Tse Lin
J. Tenenbaum
Chuang Gan
179
23
0
13 Jul 2022
Inner Monologue: Embodied Reasoning through Planning with Language
  Models
Inner Monologue: Embodied Reasoning through Planning with Language ModelsConference on Robot Learning (CoRL), 2022
Wenlong Huang
F. Xia
Ted Xiao
Harris Chan
Jacky Liang
...
Tomas Jackson
Linda Luu
Sergey Levine
Karol Hausman
Brian Ichter
LLMAGLM&RoLRM
352
1,135
0
12 Jul 2022
Video Graph Transformer for Video Question Answering
Video Graph Transformer for Video Question AnsweringEuropean Conference on Computer Vision (ECCV), 2022
Junbin Xiao
Pan Zhou
Tat-Seng Chua
Shuicheng Yan
ViT
451
94
0
12 Jul 2022
Bridging the Gap between Object and Image-level Representations for
  Open-Vocabulary Detection
Bridging the Gap between Object and Image-level Representations for Open-Vocabulary DetectionNeural Information Processing Systems (NeurIPS), 2022
H. Rasheed
Muhammad Maaz
Muhammad Uzair Khattak
Salman Khan
Fahad Shahbaz Khan
ObjDVLM
297
181
0
07 Jul 2022
STVGFormer: Spatio-Temporal Video Grounding with Static-Dynamic
  Cross-Modal Understanding
STVGFormer: Spatio-Temporal Video Grounding with Static-Dynamic Cross-Modal Understanding
Zihang Lin
Chaolei Tan
Jianfang Hu
Zhi Jin
Tiancai Ye
Weihao Zheng
181
4
0
06 Jul 2022
Toward Explainable and Fine-Grained 3D Grounding through Referring
  Textual Phrases
Toward Explainable and Fine-Grained 3D Grounding through Referring Textual Phrases
Zhihao Yuan
Xu Yan
Zhuo Li
Xuhao Li
Yao Guo
Shuguang Cui
Zhen Li
144
18
0
05 Jul 2022
VL-CheckList: Evaluating Pre-trained Vision-Language Models with
  Objects, Attributes and Relations
VL-CheckList: Evaluating Pre-trained Vision-Language Models with Objects, Attributes and Relations
Tiancheng Zhao
Tianqi Zhang
Mingwei Zhu
Haozhan Shen
Kyusong Lee
Xiaopeng Lu
Jianwei Yin
VLMCoGeMLLM
258
110
0
01 Jul 2022
EBMs vs. CL: Exploring Self-Supervised Visual Pretraining for Visual
  Question Answering
EBMs vs. CL: Exploring Self-Supervised Visual Pretraining for Visual Question Answering
Violetta Shevchenko
Ehsan Abbasnejad
A. Dick
Anton Van Den Hengel
Damien Teney
187
0
0
29 Jun 2022
DALL-E for Detection: Language-driven Compositional Image Synthesis for
  Object Detection
DALL-E for Detection: Language-driven Compositional Image Synthesis for Object Detection
Yunhao Ge
Lyne Tchapmi
Brian Nlong Zhao
Neel Joshi
Laurent Itti
Vibhav Vineet
DiffMObjD
290
23
0
20 Jun 2022
What is Where by Looking: Weakly-Supervised Open-World Phrase-Grounding
  without Text Inputs
What is Where by Looking: Weakly-Supervised Open-World Phrase-Grounding without Text InputsNeural Information Processing Systems (NeurIPS), 2022
Tal Shaharabany
Yoad Tewel
Lior Wolf
ObjD
214
22
0
19 Jun 2022
Unified-IO: A Unified Model for Vision, Language, and Multi-Modal Tasks
Unified-IO: A Unified Model for Vision, Language, and Multi-Modal TasksInternational Conference on Learning Representations (ICLR), 2022
Jiasen Lu
Christopher Clark
Rowan Zellers
Roozbeh Mottaghi
Aniruddha Kembhavi
ObjDVLMMLLM
345
470
0
17 Jun 2022
BridgeTower: Building Bridges Between Encoders in Vision-Language
  Representation Learning
BridgeTower: Building Bridges Between Encoders in Vision-Language Representation LearningAAAI Conference on Artificial Intelligence (AAAI), 2022
Xiao Xu
Chenfei Wu
Shachar Rosenman
Vasudev Lal
Wanxiang Che
Nan Duan
160
90
0
17 Jun 2022
MixGen: A New Multi-Modal Data Augmentation
MixGen: A New Multi-Modal Data Augmentation
Xiaoshuai Hao
Yi Zhu
Srikar Appalaraju
Aston Zhang
Wanqian Zhang
Boyang Li
Mu Li
VLM
289
120
0
16 Jun 2022
SAVi++: Towards End-to-End Object-Centric Learning from Real-World
  Videos
SAVi++: Towards End-to-End Object-Centric Learning from Real-World VideosNeural Information Processing Systems (NeurIPS), 2022
Gamaleldin F. Elsayed
Aravindh Mahendran
Sjoerd van Steenkiste
Klaus Greff
Michael C. Mozer
Thomas Kipf
VOSOCL
317
167
0
15 Jun 2022
Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone
Coarse-to-Fine Vision-Language Pre-training with Fusion in the BackboneNeural Information Processing Systems (NeurIPS), 2022
Zi-Yi Dou
Aishwarya Kamath
Zhe Gan
Pengchuan Zhang
Jianfeng Wang
...
Ce Liu
Yann LeCun
Nanyun Peng
Jianfeng Gao
Lijuan Wang
VLMObjD
218
149
0
15 Jun 2022
ReCo: Retrieve and Co-segment for Zero-shot Transfer
ReCo: Retrieve and Co-segment for Zero-shot TransferNeural Information Processing Systems (NeurIPS), 2022
Gyungin Shin
Weidi Xie
Samuel Albanie
VLM
315
119
0
14 Jun 2022
TransVG++: End-to-End Visual Grounding with Language Conditioned Vision
  Transformer
TransVG++: End-to-End Visual Grounding with Language Conditioned Vision TransformerIEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2022
Jiajun Deng
Zhengyuan Yang
Daqing Liu
Tianlang Chen
Wen-gang Zhou
Yanyong Zhang
Houqiang Li
Wanli Ouyang
ViT
200
86
0
14 Jun 2022
INDIGO: Intrinsic Multimodality for Domain Generalization
INDIGO: Intrinsic Multimodality for Domain Generalization
Puneet Mangla
Shivam Chandhok
Milan Aggarwal
V. Balasubramanian
Balaji Krishnamurthy
VLM
148
3
0
13 Jun 2022
GLIPv2: Unifying Localization and Vision-Language Understanding
GLIPv2: Unifying Localization and Vision-Language Understanding
Haotian Zhang
Pengchuan Zhang
Xiaowei Hu
Yen-Chun Chen
Liunian Harold Li
Xiyang Dai
Lijuan Wang
Lu Yuan
Lei Li
Jianfeng Gao
ObjDVLM
241
352
0
12 Jun 2022
Referring Image Matting
Referring Image MattingComputer Vision and Pattern Recognition (CVPR), 2022
Jizhizi Li
Jing Zhang
Dacheng Tao
ObjDVLM
180
31
0
10 Jun 2022
VLUE: A Multi-Task Benchmark for Evaluating Vision-Language Models
VLUE: A Multi-Task Benchmark for Evaluating Vision-Language Models
Wangchunshu Zhou
Yan Zeng
Shizhe Diao
Xinsong Zhang
CoGeVLM
260
14
0
30 May 2022
Visual Superordinate Abstraction for Robust Concept Learning
Visual Superordinate Abstraction for Robust Concept LearningMachine Intelligence Research (MIR), 2022
Qinjie Zheng
Chaoyue Wang
Dadong Wang
Dacheng Tao
VLM
148
3
0
28 May 2022
mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal
  Skip-connections
mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal Skip-connectionsConference on Empirical Methods in Natural Language Processing (EMNLP), 2022
Chenliang Li
Haiyang Xu
Junfeng Tian
Wei Wang
Ming Yan
...
Ji Zhang
Songfang Huang
Feiran Huang
Jingren Zhou
Luo Si
VLMMLLM
201
264
0
24 May 2022
Wireless Ad Hoc Federated Learning: A Fully Distributed Cooperative
  Machine Learning
Wireless Ad Hoc Federated Learning: A Fully Distributed Cooperative Machine Learning
H. Ochiai
Yuwei Sun
Qingzhe Jin
Nattanon Wongwiwatchai
Hiroshi Esaki
125
24
0
24 May 2022
PEVL: Position-enhanced Pre-training and Prompt Tuning for
  Vision-language Models
PEVL: Position-enhanced Pre-training and Prompt Tuning for Vision-language ModelsConference on Empirical Methods in Natural Language Processing (EMNLP), 2022
Yuan Yao
Qi-An Chen
Ao Zhang
Wei Ji
Zhiyuan Liu
Tat-Seng Chua
Maosong Sun
VLMMLLM
217
43
0
23 May 2022
Training Vision-Language Transformers from Captions
Training Vision-Language Transformers from Captions
Liangke Gui
Yingshan Chang
Qiuyuan Huang
Subhojit Som
Alexander G. Hauptmann
Jianfeng Gao
Yonatan Bisk
VLMViT
342
11
0
19 May 2022
Simple Open-Vocabulary Object Detection with Vision Transformers
Simple Open-Vocabulary Object Detection with Vision Transformers
Matthias Minderer
A. Gritsenko
Austin Stone
Maxim Neumann
Dirk Weissenborn
...
Zhuoran Shen
Tianlin Li
Xiaohua Zhai
Thomas Kipf
N. Houlsby
ObjDCLIPVLMViTOCL
254
362
0
12 May 2022
Weakly-supervised segmentation of referring expressions
Weakly-supervised segmentation of referring expressions
Robin Strudel
Ivan Laptev
Cordelia Schmid
209
28
0
10 May 2022
Beyond Bounding Box: Multimodal Knowledge Learning for Object Detection
Beyond Bounding Box: Multimodal Knowledge Learning for Object Detection
Wei Feng
Xingyuan Bu
Chenchen Zhang
Xubin Li
VLM
113
5
0
09 May 2022
Declaration-based Prompt Tuning for Visual Question Answering
Declaration-based Prompt Tuning for Visual Question AnsweringInternational Joint Conference on Artificial Intelligence (IJCAI), 2022
Yuhang Liu
Wei Wei
Daowan Peng
Feida Zhu
MLLMVLM
98
21
0
05 May 2022
Answer-Me: Multi-Task Open-Vocabulary Visual Question Answering
Answer-Me: Multi-Task Open-Vocabulary Visual Question Answering
A. Piergiovanni
Wei Li
Weicheng Kuo
M. Saffar
Fred Bertsch
A. Angelova
207
18
0
02 May 2022
A Multi-level Alignment Training Scheme for Video-and-Language Grounding
A Multi-level Alignment Training Scheme for Video-and-Language Grounding
Yubo Zhang
Feiyang Niu
Q. Ping
Govind Thattai
CVBM
165
2
0
22 Apr 2022
Self-paced Multi-grained Cross-modal Interaction Modeling for Referring
  Expression Comprehension
Self-paced Multi-grained Cross-modal Interaction Modeling for Referring Expression ComprehensionIEEE Transactions on Image Processing (IEEE TIP), 2022
Peihan Miao
Wei Su
Gaoang Wang
Xuewei Li
Xi Li
ObjD
246
12
0
21 Apr 2022
Previous
123...11121314
Next