ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2026 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2104.12763
  4. Cited By
MDETR -- Modulated Detection for End-to-End Multi-Modal Understanding
v1v2 (latest)

MDETR -- Modulated Detection for End-to-End Multi-Modal Understanding

IEEE International Conference on Computer Vision (ICCV), 2021
26 April 2021
Aishwarya Kamath
Mannat Singh
Yann LeCun
Gabriel Synnaeve
Ishan Misra
Nicolas Carion
    ObjDVLM
ArXiv (abs)PDFHTMLGithub (1008★)

Papers citing "MDETR -- Modulated Detection for End-to-End Multi-Modal Understanding"

50 / 678 papers shown
Weakly-supervised segmentation of referring expressions
Weakly-supervised segmentation of referring expressions
Robin Strudel
Ivan Laptev
Cordelia Schmid
234
29
0
10 May 2022
Beyond Bounding Box: Multimodal Knowledge Learning for Object Detection
Beyond Bounding Box: Multimodal Knowledge Learning for Object Detection
Wei Feng
Xingyuan Bu
Chenchen Zhang
Xubin Li
VLM
153
5
0
09 May 2022
Declaration-based Prompt Tuning for Visual Question Answering
Declaration-based Prompt Tuning for Visual Question AnsweringInternational Joint Conference on Artificial Intelligence (IJCAI), 2022
Yuhang Liu
Wei Wei
Daowan Peng
Feida Zhu
MLLMVLM
118
21
0
05 May 2022
Answer-Me: Multi-Task Open-Vocabulary Visual Question Answering
Answer-Me: Multi-Task Open-Vocabulary Visual Question Answering
A. Piergiovanni
Wei Li
Weicheng Kuo
M. Saffar
Fred Bertsch
A. Angelova
279
18
0
02 May 2022
A Multi-level Alignment Training Scheme for Video-and-Language Grounding
A Multi-level Alignment Training Scheme for Video-and-Language Grounding
Yubo Zhang
Feiyang Niu
Q. Ping
Govind Thattai
CVBM
219
2
0
22 Apr 2022
Self-paced Multi-grained Cross-modal Interaction Modeling for Referring
  Expression Comprehension
Self-paced Multi-grained Cross-modal Interaction Modeling for Referring Expression ComprehensionIEEE Transactions on Image Processing (IEEE TIP), 2022
Peihan Miao
Wei Su
Gaoang Wang
Xuewei Li
Xi Li
ObjD
334
13
0
21 Apr 2022
A Survivor in the Era of Large-Scale Pretraining: An Empirical Study of
  One-Stage Referring Expression Comprehension
A Survivor in the Era of Large-Scale Pretraining: An Empirical Study of One-Stage Referring Expression ComprehensionIEEE transactions on multimedia (IEEE TMM), 2022
Gen Luo
Weihao Ye
Jiamu Sun
Xiaoshuai Sun
Rongrong Ji
ObjD
243
13
0
17 Apr 2022
ReCLIP: A Strong Zero-Shot Baseline for Referring Expression
  Comprehension
ReCLIP: A Strong Zero-Shot Baseline for Referring Expression ComprehensionAnnual Meeting of the Association for Computational Linguistics (ACL), 2022
Sanjay Subramanian
William Merrill
Trevor Darrell
Matt Gardner
Sameer Singh
Anna Rohrbach
ObjD
284
156
0
12 Apr 2022
X-DETR: A Versatile Architecture for Instance-wise Vision-Language Tasks
X-DETR: A Versatile Architecture for Instance-wise Vision-Language TasksEuropean Conference on Computer Vision (ECCV), 2022
Zhaowei Cai
Gukyeong Kwon
Avinash Ravichandran
Erhan Bas
Zhuowen Tu
Rahul Bhotika
Stefano Soatto
ObjDMLLMVLM
145
51
0
12 Apr 2022
Domain-Agnostic Prior for Transfer Semantic Segmentation
Domain-Agnostic Prior for Transfer Semantic SegmentationComputer Vision and Pattern Recognition (CVPR), 2022
Xinyue Huo
Lingxi Xie
Hengtong Hu
Wen-gang Zhou
Houqiang Li
Qi Tian
220
38
0
06 Apr 2022
"This is my unicorn, Fluffy": Personalizing frozen vision-language
  representations
"This is my unicorn, Fluffy": Personalizing frozen vision-language representationsEuropean Conference on Computer Vision (ECCV), 2022
Niv Cohen
Rinon Gal
E. Meirom
Gal Chechik
Yuval Atzmon
VLMMLLM
351
104
0
04 Apr 2022
MultiMAE: Multi-modal Multi-task Masked Autoencoders
MultiMAE: Multi-modal Multi-task Masked AutoencodersEuropean Conference on Computer Vision (ECCV), 2022
Roman Bachmann
David Mizrahi
Andrei Atanov
Amir Zamir
423
345
0
04 Apr 2022
Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language
Socratic Models: Composing Zero-Shot Multimodal Reasoning with LanguageInternational Conference on Learning Representations (ICLR), 2022
Andy Zeng
Maria Attarian
Brian Ichter
K. Choromanski
Adrian S. Wong
...
Michael S. Ryoo
Vikas Sindhwani
Johnny Lee
Vincent Vanhoucke
Peter R. Florence
ReLMLRM
590
681
0
01 Apr 2022
FindIt: Generalized Localization with Natural Language Queries
FindIt: Generalized Localization with Natural Language QueriesEuropean Conference on Computer Vision (ECCV), 2022
Weicheng Kuo
Fred Bertsch
Wei Li
A. Piergiovanni
M. Saffar
A. Angelova
ObjD
210
18
0
31 Mar 2022
ReSTR: Convolution-free Referring Image Segmentation Using Transformers
ReSTR: Convolution-free Referring Image Segmentation Using TransformersComputer Vision and Pattern Recognition (CVPR), 2022
N. Kim
Dongwon Kim
Cuiling Lan
Wenjun Zeng
Suha Kwak
345
178
0
31 Mar 2022
TubeDETR: Spatio-Temporal Video Grounding with Transformers
TubeDETR: Spatio-Temporal Video Grounding with TransformersComputer Vision and Pattern Recognition (CVPR), 2022
Antoine Yang
Antoine Miech
Josef Sivic
Ivan Laptev
Cordelia Schmid
ViT
341
121
0
30 Mar 2022
Shifting More Attention to Visual Backbone: Query-modulated Refinement
  Networks for End-to-End Visual Grounding
Shifting More Attention to Visual Backbone: Query-modulated Refinement Networks for End-to-End Visual GroundingComputer Vision and Pattern Recognition (CVPR), 2022
Jiabo Ye
Junfeng Tian
Ming Yan
Xiaoshan Yang
Xuwu Wang
Ji Zhang
Liang He
Xin Lin
ObjD
230
93
0
29 Mar 2022
Open-Vocabulary DETR with Conditional Matching
Open-Vocabulary DETR with Conditional MatchingEuropean Conference on Computer Vision (ECCV), 2022
Yuhang Zang
Wei Li
Kaiyang Zhou
Chen Huang
Chen Change Loy
ObjDVLM
382
262
0
22 Mar 2022
CoWs on Pasture: Baselines and Benchmarks for Language-Driven Zero-Shot
  Object Navigation
CoWs on Pasture: Baselines and Benchmarks for Language-Driven Zero-Shot Object NavigationComputer Vision and Pattern Recognition (CVPR), 2022
S. Gadre
Mitchell Wortsman
Gabriel Ilharco
Ludwig Schmidt
Shuran Song
CLIPLM&Ro
337
235
0
20 Mar 2022
Local-Global Context Aware Transformer for Language-Guided Video
  Segmentation
Local-Global Context Aware Transformer for Language-Guided Video SegmentationIEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2022
Chen Liang
Wenguan Wang
Tianfei Zhou
Jiaxu Miao
Yawei Luo
Yi Yang
VOS
322
101
0
18 Mar 2022
End-to-End Modeling via Information Tree for One-Shot Natural Language
  Spatial Video Grounding
End-to-End Modeling via Information Tree for One-Shot Natural Language Spatial Video GroundingAnnual Meeting of the Association for Computational Linguistics (ACL), 2022
Meng Li
Tianbao Wang
Haoyu Zhang
Shengyu Zhang
Zhou Zhao
...
Wenming Tan
Jin Wang
Peng Wang
Shi Pu
Leilei Gan
292
46
0
15 Mar 2022
Can you even tell left from right? Presenting a new challenge for VQA
Can you even tell left from right? Presenting a new challenge for VQAIEEE Workshop/Winter Conference on Applications of Computer Vision (WACV), 2022
Sairaam Venkatraman
Rishi Rao
S. Balasubramanian
C. Vorugunti
R. R. Sarma
CoGe
174
0
0
15 Mar 2022
Backbone is All Your Need: A Simplified Architecture for Visual Object
  Tracking
Backbone is All Your Need: A Simplified Architecture for Visual Object TrackingEuropean Conference on Computer Vision (ECCV), 2022
Boyu Chen
Peixia Li
Mengwei He
Leixian Qiao
Qiuhong Shen
Yue Liu
Weihao Gan
Wei Wu
Wanli Ouyang
ViTVOT
272
269
0
10 Mar 2022
CMX: Cross-Modal Fusion for RGB-X Semantic Segmentation with
  Transformers
CMX: Cross-Modal Fusion for RGB-X Semantic Segmentation with Transformers
Kailai Li
Huayao Liu
Kailun Yang
Xinxin Hu
Ruiping Liu
Rainer Stiefelhagen
ViT
417
513
0
09 Mar 2022
DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object
  Detection
DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object DetectionInternational Conference on Learning Representations (ICLR), 2022
Hao Zhang
Feng Li
Shilong Liu
Lei Zhang
Hang Su
Jun Zhu
L. Ni
H. Shum
ViT
744
2,208
0
07 Mar 2022
DIME: Fine-grained Interpretations of Multimodal Models via Disentangled
  Local Explanations
DIME: Fine-grained Interpretations of Multimodal Models via Disentangled Local ExplanationsAAAI/ACM Conference on AI, Ethics, and Society (AIES), 2022
Yiwei Lyu
Paul Pu Liang
Zihao Deng
Ruslan Salakhutdinov
Louis-Philippe Morency
234
51
0
03 Mar 2022
Vision-Language Intelligence: Tasks, Representation Learning, and Large
  Models
Vision-Language Intelligence: Tasks, Representation Learning, and Large Models
Feng Li
Hao Zhang
Yi-Fan Zhang
Shixuan Liu
Jian Guo
L. Ni
Pengchuan Zhang
Lei Zhang
AI4TSVLM
207
41
0
03 Mar 2022
CrossPoint: Self-Supervised Cross-Modal Contrastive Learning for 3D
  Point Cloud Understanding
CrossPoint: Self-Supervised Cross-Modal Contrastive Learning for 3D Point Cloud UnderstandingComputer Vision and Pattern Recognition (CVPR), 2022
Mohamed Afham
Isuru Dissanayake
Dinithi Dissanayake
Amaya Dharmasiri
Kanchana Thilakarathna
Ranga Rodrigo
3DPC
329
318
0
01 Mar 2022
Unsupervised Vision-and-Language Pre-training via Retrieval-based
  Multi-Granular Alignment
Unsupervised Vision-and-Language Pre-training via Retrieval-based Multi-Granular AlignmentComputer Vision and Pattern Recognition (CVPR), 2022
Mingyang Zhou
Licheng Yu
Amanpreet Singh
Mengjiao MJ Wang
Zhou Yu
Ning Zhang
VLM
158
35
0
01 Mar 2022
Measuring CLEVRness: Blackbox testing of Visual Reasoning Models
Measuring CLEVRness: Blackbox testing of Visual Reasoning ModelsInternational Conference on Learning Representations (ICLR), 2022
Spyridon Mouselinos
Henryk Michalewski
Mateusz Malinowski
270
4
0
24 Feb 2022
GroupViT: Semantic Segmentation Emerges from Text Supervision
GroupViT: Semantic Segmentation Emerges from Text SupervisionComputer Vision and Pattern Recognition (CVPR), 2022
Jiarui Xu
Shalini De Mello
Sifei Liu
Wonmin Byeon
Thomas Breuel
Jan Kautz
Xinyu Wang
ViTVLM
759
631
0
22 Feb 2022
VLP: A Survey on Vision-Language Pre-training
VLP: A Survey on Vision-Language Pre-trainingMachine Intelligence Research (MIR), 2022
Feilong Chen
Duzhen Zhang
Minglun Han
Xiuyi Chen
Jing Shi
Shuang Xu
Bo Xu
VLM
393
287
0
18 Feb 2022
Delving Deeper into Cross-lingual Visual Question Answering
Delving Deeper into Cross-lingual Visual Question AnsweringFindings (Findings), 2022
Chen Cecilia Liu
Jonas Pfeiffer
Anna Korhonen
Ivan Vulić
Iryna Gurevych
300
10
0
15 Feb 2022
An experimental study of the vision-bottleneck in VQA
An experimental study of the vision-bottleneck in VQASocial Science Research Network (SSRN), 2022
Pierre Marza
Corentin Kervadec
G. Antipov
M. Baccouche
Christian Wolf
250
1
0
14 Feb 2022
OFA: Unifying Architectures, Tasks, and Modalities Through a Simple
  Sequence-to-Sequence Learning Framework
OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning FrameworkInternational Conference on Machine Learning (ICML), 2022
Peng Wang
An Yang
Rui Men
Junyang Lin
Shuai Bai
Zhikang Li
Jianxin Ma
Chang Zhou
Jingren Zhou
Hongxia Yang
MLLMObjD
517
1,009
0
07 Feb 2022
Transformers in Medical Imaging: A Survey
Transformers in Medical Imaging: A Survey
Fahad Shamshad
Salman Khan
Syed Waqas Zamir
Muhammad Haris Khan
Munawar Hayat
Fahad Shahbaz Khan
Huazhu Fu
ViTLM&MAMedIm
322
958
0
24 Jan 2022
Omnivore: A Single Model for Many Visual Modalities
Omnivore: A Single Model for Many Visual ModalitiesComputer Vision and Pattern Recognition (CVPR), 2022
Rohit Girdhar
Mannat Singh
Nikhil Ravi
Laurens van der Maaten
Armand Joulin
Ishan Misra
597
287
0
20 Jan 2022
Label-dependent and event-guided interpretable disease risk prediction
  using EHRs
Label-dependent and event-guided interpretable disease risk prediction using EHRsIEEE International Conference on Bioinformatics and Biomedicine (BIBM), 2021
Shuai Niu
Yunya Song
Qing Yin
Wenhan Luo
Xian Yang
106
4
0
18 Jan 2022
Unpaired Referring Expression Grounding via Bidirectional Cross-Modal
  Matching
Unpaired Referring Expression Grounding via Bidirectional Cross-Modal MatchingNeurocomputing (Neurocomputing), 2022
Hengcan Shi
Munawar Hayat
Jianfei Cai
ObjD
207
12
0
18 Jan 2022
Multi-Query Video Retrieval
Multi-Query Video RetrievalEuropean Conference on Computer Vision (ECCV), 2022
Zeyu Wang
Yu Wu
Karthik Narasimhan
Olga Russakovsky
285
23
0
10 Jan 2022
Language-driven Semantic Segmentation
Language-driven Semantic SegmentationInternational Conference on Learning Representations (ICLR), 2022
Boyi Li
Kilian Q. Weinberger
Serge Belongie
V. Koltun
René Ranftl
VLM
329
780
0
10 Jan 2022
Detecting Twenty-thousand Classes using Image-level Supervision
Detecting Twenty-thousand Classes using Image-level SupervisionEuropean Conference on Computer Vision (ECCV), 2022
Xingyi Zhou
Rohit Girdhar
Armand Joulin
Phillip Krahenbuhl
Ishan Misra
CLIPVLM
488
752
0
07 Jan 2022
Language as Queries for Referring Video Object Segmentation
Language as Queries for Referring Video Object SegmentationComputer Vision and Pattern Recognition (CVPR), 2022
Jiannan Wu
Yi Jiang
Pei Sun
Zehuan Yuan
Ping Luo
516
220
0
03 Jan 2022
Scaling Open-Vocabulary Image Segmentation with Image-Level Labels
Scaling Open-Vocabulary Image Segmentation with Image-Level LabelsEuropean Conference on Computer Vision (ECCV), 2021
Golnaz Ghiasi
Xiuye Gu
Huayu Chen
Nayeon Lee
VLM
444
494
0
22 Dec 2021
Image Segmentation Using Text and Image Prompts
Image Segmentation Using Text and Image PromptsComputer Vision and Pattern Recognition (CVPR), 2021
Timo Lüddecke
Alexander S. Ecker
CLIPVLM
710
647
0
18 Dec 2021
Bottom Up Top Down Detection Transformers for Language Grounding in
  Images and Point Clouds
Bottom Up Top Down Detection Transformers for Language Grounding in Images and Point Clouds
Ayush Jain
N. Gkanatsios
Ishita Mediratta
Katerina Fragkiadaki
ObjD
479
147
0
16 Dec 2021
Predicting Physical World Destinations for Commands Given to
  Self-Driving Cars
Predicting Physical World Destinations for Commands Given to Self-Driving Cars
Dusan Grujicic
Thierry Deruyttere
Marie-Francine Moens
Matthew Blaschko
OOD
200
8
0
10 Dec 2021
PTR: A Benchmark for Part-based Conceptual, Relational, and Physical
  Reasoning
PTR: A Benchmark for Part-based Conceptual, Relational, and Physical Reasoning
Yining Hong
Li Yi
J. Tenenbaum
Antonio Torralba
Chuang Gan
168
43
0
09 Dec 2021
Grounded Language-Image Pre-training
Grounded Language-Image Pre-training
Liunian Harold Li
Pengchuan Zhang
Haotian Zhang
Jianwei Yang
Chunyuan Li
...
Lu Yuan
Lei Zhang
Lei Li
Kai-Wei Chang
Jianfeng Gao
ObjDVLM
458
1,385
0
07 Dec 2021
From Coarse to Fine-grained Concept based Discrimination for Phrase
  Detection
From Coarse to Fine-grained Concept based Discrimination for Phrase Detection
Maan Qraitem
Bryan A. Plummer
ObjD
195
0
0
06 Dec 2021
Previous
123...121314
Next