ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2026 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2104.12763
  4. Cited By
MDETR -- Modulated Detection for End-to-End Multi-Modal Understanding
v1v2 (latest)

MDETR -- Modulated Detection for End-to-End Multi-Modal Understanding

IEEE International Conference on Computer Vision (ICCV), 2021
26 April 2021
Aishwarya Kamath
Mannat Singh
Yann LeCun
Gabriel Synnaeve
Ishan Misra
Nicolas Carion
    ObjDVLM
ArXiv (abs)PDFHTMLGithub (1008★)

Papers citing "MDETR -- Modulated Detection for End-to-End Multi-Modal Understanding"

50 / 678 papers shown
UMG-CLIP: A Unified Multi-Granularity Vision Generalist for Open-World UnderstandingEuropean Conference on Computer Vision (ECCV), 2024
Bowen Shi
Peisen Zhao
Zichen Wang
Yuhang Zhang
Yaoming Wang
...
Wenrui Dai
Junni Zou
Hongkai Xiong
Qi Tian
Xiaopeng Zhang
VLM
192
12
0
12 Jan 2024
GroundingGPT:Language Enhanced Multi-modal Grounding Model
GroundingGPT:Language Enhanced Multi-modal Grounding ModelAnnual Meeting of the Association for Computational Linguistics (ACL), 2024
Zhaowei Li
Qi Xu
Dong Zhang
Hang Song
Yiqing Cai
...
Junting Pan
Zefeng Li
Van Tu Vu
Zhida Huang
Tao Wang
616
95
0
11 Jan 2024
An Open and Comprehensive Pipeline for Unified Object Grounding and
  Detection
An Open and Comprehensive Pipeline for Unified Object Grounding and Detection
Xiangyu Zhao
Yicheng Chen
Shilin Xu
Xiangtai Li
Xinjiang Wang
Yining Li
Haian Huang
ObjDAI4CE
308
54
0
04 Jan 2024
Context-Guided Spatio-Temporal Video Grounding
Context-Guided Spatio-Temporal Video GroundingComputer Vision and Pattern Recognition (CVPR), 2024
Xin Gu
Hengrui Fan
Yan Huang
Tiejian Luo
Libo Zhang
272
39
0
03 Jan 2024
Glance and Focus: Memory Prompting for Multi-Event Video Question
  Answering
Glance and Focus: Memory Prompting for Multi-Event Video Question AnsweringNeural Information Processing Systems (NeurIPS), 2024
Ziyi Bai
Ruiping Wang
Xilin Chen
350
12
0
03 Jan 2024
Generating Enhanced Negatives for Training Language-Based Object
  Detectors
Generating Enhanced Negatives for Training Language-Based Object DetectorsComputer Vision and Pattern Recognition (CVPR), 2023
Shiyu Zhao
Long Zhao
Vijay Kumar B.G
Yumin Suh
Dimitris N. Metaxas
Manmohan Chandraker
S. Schulter
ObjDVLM
449
13
0
29 Dec 2023
Bridging Modality Gap for Visual Grounding with Effecitve Cross-modal
  Distillation
Bridging Modality Gap for Visual Grounding with Effecitve Cross-modal DistillationChinese Conference on Pattern Recognition and Computer Vision (CPRCV), 2023
Jiaxi Wang
Wenhui Hu
Xueyang Liu
Beihu Wu
Yuting Qiu
Yingying Cai
280
1
0
29 Dec 2023
Set Prediction Guided by Semantic Concepts for Diverse Video Captioning
Set Prediction Guided by Semantic Concepts for Diverse Video Captioning
Yifan Lu
Ziqi Zhang
Chunfen Yuan
Peng Li
Yan Wang
Bing Li
Weiming Hu
161
6
0
25 Dec 2023
UniRef++: Segment Every Reference Object in Spatial and Temporal Spaces
UniRef++: Segment Every Reference Object in Spatial and Temporal Spaces
Jiannan Wu
Yi Jiang
Bin Yan
Huchuan Lu
Zehuan Yuan
Ping Luo
VOS
273
26
0
25 Dec 2023
Cycle-Consistency Learning for Captioning and Grounding
Cycle-Consistency Learning for Captioning and Grounding
Ning Wang
Jiajun Deng
Mingbo Jia
ObjD
235
13
0
23 Dec 2023
Learning from Mistakes: Iterative Prompt Relabeling for Text-to-Image Diffusion Model Training
Learning from Mistakes: Iterative Prompt Relabeling for Text-to-Image Diffusion Model Training
Xinyan Chen
Jiaxin Ge
Tianjun Zhang
Jiaming Liu
Shanghang Zhang
VLMEGVM
455
2
0
23 Dec 2023
GroundVLP: Harnessing Zero-shot Visual Grounding from Vision-Language
  Pre-training and Open-Vocabulary Object Detection
GroundVLP: Harnessing Zero-shot Visual Grounding from Vision-Language Pre-training and Open-Vocabulary Object Detection
Haozhan Shen
Tiancheng Zhao
Mingwei Zhu
Yuxiang Cai
VLMObjD
413
25
0
22 Dec 2023
V*: Guided Visual Search as a Core Mechanism in Multimodal LLMs
V*: Guided Visual Search as a Core Mechanism in Multimodal LLMs
Penghao Wu
Saining Xie
LRM
405
324
0
21 Dec 2023
A Semantic Space is Worth 256 Language Descriptions: Make Stronger
  Segmentation Models with Descriptive Properties
A Semantic Space is Worth 256 Language Descriptions: Make Stronger Segmentation Models with Descriptive Properties
Junfei Xiao
Ziqi Zhou
Wenxuan Li
Shiyi Lan
Jieru Mei
Zhiding Yu
Yaoyao Liu
Yuyin Zhou
Cihang Xie
VLM
187
2
0
21 Dec 2023
Perception Test 2023: A Summary of the First Challenge And Outcome
Perception Test 2023: A Summary of the First Challenge And Outcome
Joseph Heyward
João Carreira
Dima Damen
Andrew Zisserman
Viorica Patraucean
236
0
0
20 Dec 2023
Spectral Prompt Tuning:Unveiling Unseen Classes for Zero-Shot Semantic
  Segmentation
Spectral Prompt Tuning:Unveiling Unseen Classes for Zero-Shot Semantic Segmentation
Wenhao Xu
Rongtao Xu
Changwei Wang
Shibiao Xu
Li Guo
Man Zhang
Xiaopeng Zhang
VLM
247
18
0
20 Dec 2023
Weakly Supervised Open-Vocabulary Object Detection
Weakly Supervised Open-Vocabulary Object Detection
Jianghang Lin
Chunjiang Ge
Bingquan Wang
Shaohui Lin
Ke Li
Liujuan Cao
WSOD
302
16
0
19 Dec 2023
Jack of All Tasks, Master of Many: Designing General-purpose
  Coarse-to-Fine Vision-Language Model
Jack of All Tasks, Master of Many: Designing General-purpose Coarse-to-Fine Vision-Language Model
Shraman Pramanick
Guangxing Han
Rui Hou
Sayan Nag
Ser-Nam Lim
Nicolas Ballas
Qifan Wang
Rama Chellappa
Amjad Almahairi
VLMMLLM
386
50
0
19 Dec 2023
Context Disentangling and Prototype Inheriting for Robust Visual
  Grounding
Context Disentangling and Prototype Inheriting for Robust Visual Grounding
Wei Tang
Liang Li
Xuejing Liu
Lu Jin
Jinhui Tang
Zechao Li
271
41
0
19 Dec 2023
Rotated Multi-Scale Interaction Network for Referring Remote Sensing
  Image Segmentation
Rotated Multi-Scale Interaction Network for Referring Remote Sensing Image Segmentation
Sihan Liu
Yiwei Ma
Xiaoqing Zhang
Haowei Wang
Jiayi Ji
Xiaoshuai Sun
Rongrong Ji
413
87
0
19 Dec 2023
Text-Conditioned Resampler For Long Form Video Understanding
Text-Conditioned Resampler For Long Form Video Understanding
Bruno Korbar
Yongqin Xian
A. Tonioni
Andrew Zisserman
Federico Tombari
305
23
0
19 Dec 2023
Pixel Aligned Language Models
Pixel Aligned Language ModelsComputer Vision and Pattern Recognition (CVPR), 2023
Jiarui Xu
Xingyi Zhou
Shen Yan
Xiuye Gu
Anurag Arnab
Chen Sun
Xiaolong Wang
Cordelia Schmid
MLLMVLM
279
17
0
14 Dec 2023
General Object Foundation Model for Images and Videos at Scale
General Object Foundation Model for Images and Videos at ScaleComputer Vision and Pattern Recognition (CVPR), 2023
Junfeng Wu
Yi Jiang
Qihao Liu
Zehuan Yuan
Xiang Bai
Song Bai
VOSVLM
339
79
0
14 Dec 2023
Exploration of visual prompt in Grounded pre-trained open-set detection
Exploration of visual prompt in Grounded pre-trained open-set detectionIEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2023
Qibo Chen
Weizhong Jin
Shuchang Li
Mengdi Liu
Li Yu
Jian Jiang
Xiaozheng Wang
VLM
115
1
0
14 Dec 2023
Segment Beyond View: Handling Partially Missing Modality for
  Audio-Visual Semantic Segmentation
Segment Beyond View: Handling Partially Missing Modality for Audio-Visual Semantic SegmentationAAAI Conference on Artificial Intelligence (AAAI), 2023
Renjie Wu
Hu Wang
Feras Dayoub
Hsiang-Ting Chen
202
10
0
14 Dec 2023
SKDF: A Simple Knowledge Distillation Framework for Distilling
  Open-Vocabulary Knowledge to Open-world Object Detector
SKDF: A Simple Knowledge Distillation Framework for Distilling Open-Vocabulary Knowledge to Open-world Object DetectorIEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2023
Shuailei Ma
Yuefeng Wang
Ying-yu Wei
Jiaqi Fan
Enming Zhang
Xinyu Sun
Peihao Chen
ObjD
306
3
0
14 Dec 2023
EZ-CLIP: Efficient Zeroshot Video Action Recognition
EZ-CLIP: Efficient Zeroshot Video Action Recognition
Shahzad Ahmad
S. Chanda
Yogesh S Rawat
VLM
270
11
0
13 Dec 2023
CLIP as RNN: Segment Countless Visual Concepts without Training Endeavor
CLIP as RNN: Segment Countless Visual Concepts without Training EndeavorComputer Vision and Pattern Recognition (CVPR), 2023
Shuyang Sun
Runjia Li
Juil Sock
Xiuye Gu
Siyang Li
VLMCLIP
468
56
0
12 Dec 2023
Genixer: Empowering Multimodal Large Language Models as a Powerful Data
  Generator
Genixer: Empowering Multimodal Large Language Models as a Powerful Data Generator
Henry Hengyuan Zhao
Pan Zhou
Mike Zheng Shou
MLLMSyDa
451
11
0
11 Dec 2023
Visual Grounding of Whole Radiology Reports for 3D CT Images
Visual Grounding of Whole Radiology Reports for 3D CT ImagesInternational Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI), 2023
Akimichi Ichinose
Taro Hatsutani
Keigo Nakamura
Yoshiro Kitamura
S. Iizuka
E. Simo-Serra
Shoji Kido
Noriyuki Tomiyama
225
12
0
08 Dec 2023
Improved Visual Grounding through Self-Consistent Explanations
Improved Visual Grounding through Self-Consistent Explanations
Ruozhen He
Paola Cascante-Bonilla
Ziyan Yang
Alexander C. Berg
Vicente Ordonez
ReLMObjDLRMFAtt
275
24
0
07 Dec 2023
GPT-4 Enhanced Multimodal Grounding for Autonomous Driving: Leveraging
  Cross-Modal Attention with Large Language Models
GPT-4 Enhanced Multimodal Grounding for Autonomous Driving: Leveraging Cross-Modal Attention with Large Language Models
Haicheng Liao
Huanming Shen
Zhenning Li
Chengyue Wang
Guofa Li
Yiming Bie
Chengzhong Xu
238
80
0
06 Dec 2023
LLaVA-Grounding: Grounded Visual Chat with Large Multimodal Models
LLaVA-Grounding: Grounded Visual Chat with Large Multimodal Models
Hao Zhang
Hongyang Li
Feng Li
Tianhe Ren
Xueyan Zou
...
Shijia Huang
Jianfeng Gao
Lei Zhang
Chun-yue Li
Jianwei Yang
343
112
0
05 Dec 2023
Lenna: Language Enhanced Reasoning Detection Assistant
Lenna: Language Enhanced Reasoning Detection Assistant
Fei Wei
Xinyu Zhang
Ailing Zhang
Bo Zhang
Xiangxiang Chu
MLLMLRM
267
32
0
05 Dec 2023
Aligning and Prompting Everything All at Once for Universal Visual
  Perception
Aligning and Prompting Everything All at Once for Universal Visual PerceptionComputer Vision and Pattern Recognition (CVPR), 2023
Chunjiang Ge
Chaoyou Fu
Peixian Chen
Mengdan Zhang
Ke Li
Xing Sun
Yunsheng Wu
Shaohui Lin
Rongrong Ji
VLMObjD
287
64
0
04 Dec 2023
Towards Generalizable Referring Image Segmentation via Target Prompt and
  Visual Coherence
Towards Generalizable Referring Image Segmentation via Target Prompt and Visual CoherenceInternational Conference on Information Photonics (ICIP), 2023
Yajie Liu
Pu Ge
Haoxiang Ma
Shichao Fan
Qingjie Liu
Di Huang
Yunhong Wang
198
1
0
01 Dec 2023
InstructSeq: Unifying Vision Tasks with Instruction-conditioned
  Multi-modal Sequence Generation
InstructSeq: Unifying Vision Tasks with Instruction-conditioned Multi-modal Sequence Generation
Rongyao Fang
Shilin Yan
Zhaoyang Huang
Jingqiu Zhou
Hao Tian
Jifeng Dai
Jiaming Song
MLLM
208
16
0
30 Nov 2023
Language-conditioned Detection Transformer
Language-conditioned Detection TransformerComputer Vision and Pattern Recognition (CVPR), 2023
Jang Hyun Cho
Philipp Krahenbuhl
VLMObjD
187
5
0
29 Nov 2023
The devil is in the fine-grained details: Evaluating open-vocabulary
  object detectors for fine-grained understanding
The devil is in the fine-grained details: Evaluating open-vocabulary object detectors for fine-grained understandingComputer Vision and Pattern Recognition (CVPR), 2023
Lorenzo Bianchi
F. Carrara
Nicola Messina
Claudio Gennaro
Fabrizio Falchi
ObjD
349
24
0
29 Nov 2023
No Representation Rules Them All in Category Discovery
No Representation Rules Them All in Category DiscoveryNeural Information Processing Systems (NeurIPS), 2023
S. Vaze
Andrea Vedaldi
Andrew Zisserman
OOD
253
55
0
28 Nov 2023
Zero-shot Referring Expression Comprehension via Structural Similarity
  Between Images and Captions
Zero-shot Referring Expression Comprehension via Structural Similarity Between Images and CaptionsComputer Vision and Pattern Recognition (CVPR), 2023
Zeyu Han
Fangrui Zhu
Qianru Lao
Huaizu Jiang
ObjD
411
19
0
28 Nov 2023
Griffon: Spelling out All Object Locations at Any Granularity with Large
  Language Models
Griffon: Spelling out All Object Locations at Any Granularity with Large Language ModelsEuropean Conference on Computer Vision (ECCV), 2023
Yufei Zhan
Yousong Zhu
Zhiyang Chen
Fan Yang
E. Goles
Jinqiao Wang
ObjD
242
30
0
24 Nov 2023
Visual In-Context Prompting
Visual In-Context PromptingComputer Vision and Pattern Recognition (CVPR), 2023
Feng Li
Qing Jiang
Hao Zhang
Tianhe Ren
Shilong Liu
...
Hongyang Li
Chun-yue Li
Jianwei Yang
Lei Zhang
Jianfeng Gao
VLMLRMMLLM
187
51
0
22 Nov 2023
Enhancing Visual Grounding and Generalization: A Multi-Task Cycle
  Training Approach for Vision-Language Models
Enhancing Visual Grounding and Generalization: A Multi-Task Cycle Training Approach for Vision-Language Models
Xiaoyu Yang
Lijian Xu
Hao Sun
Jiaming Song
Shaoting Zhang
ObjD
432
11
0
21 Nov 2023
To See is to Believe: Prompting GPT-4V for Better Visual Instruction
  Tuning
To See is to Believe: Prompting GPT-4V for Better Visual Instruction Tuning
Junke Wang
Lingchen Meng
Zejia Weng
Bo He
Zuxuan Wu
Yu-Gang Jiang
MLLMVLM
268
135
0
13 Nov 2023
PerceptionGPT: Effectively Fusing Visual Perception into LLM
PerceptionGPT: Effectively Fusing Visual Perception into LLMComputer Vision and Pattern Recognition (CVPR), 2023
Renjie Pi
Lewei Yao
Jiahui Gao
Jipeng Zhang
Tong Zhang
MLLM
194
56
0
11 Nov 2023
Language-guided Robot Grasping: CLIP-based Referring Grasp Synthesis in
  Clutter
Language-guided Robot Grasping: CLIP-based Referring Grasp Synthesis in ClutterConference on Robot Learning (CoRL), 2023
Georgios Tziafas
Yucheng Xu
Arushi Goel
Mohammadreza Kasaei
Zhibin Li
Hamidreza Kasaei
240
40
0
09 Nov 2023
DualTalker: A Cross-Modal Dual Learning Approach for Speech-Driven 3D
  Facial Animation
DualTalker: A Cross-Modal Dual Learning Approach for Speech-Driven 3D Facial Animation
Guinan Su
Yanwu Yang
Zhifeng Li
VGen
235
3
0
08 Nov 2023
GLaMM: Pixel Grounding Large Multimodal Model
GLaMM: Pixel Grounding Large Multimodal ModelComputer Vision and Pattern Recognition (CVPR), 2023
H. Rasheed
Muhammad Maaz
Sahal Shaji Mullappilly
Abdelrahman M. Shaker
Salman Khan
Hisham Cholakkal
Rao M. Anwer
Erix Xing
Ming-Hsuan Yang
Fahad S. Khan
MLLMVLM
434
396
0
06 Nov 2023
Augment the Pairs: Semantics-Preserving Image-Caption Pair Augmentation
  for Grounding-Based Vision and Language Models
Augment the Pairs: Semantics-Preserving Image-Caption Pair Augmentation for Grounding-Based Vision and Language ModelsIEEE Workshop/Winter Conference on Applications of Computer Vision (WACV), 2023
Jingru Yi
Burak Uzkent
Oana Ignat
Zili Li
Amanmeet Garg
Xiang Yu
Linda Liu
VLM
283
2
0
05 Nov 2023
Previous
123...567...121314
Next