v1v2 (latest)

MDETR -- Modulated Detection for End-to-End Multi-Modal Understanding

IEEE International Conference on Computer Vision (ICCV), 2021

26 April 2021

ArXiv (abs)PDF HTML Github (1008★)

Papers citing "MDETR -- Modulated Detection for End-to-End Multi-Modal Understanding"

50 / 678 papers shown

Proposal-Level Unsupervised Domain Adaptation for Open World Unbiased Detector

Xuanyi Liu

Zhongqi Yue

Xian-Sheng Hua

309

04 Nov 2023

Recognize Any RegionsNeural Information Processing Systems (NeurIPS), 2023

361

02 Nov 2023

Spuriosity Rankings for Free: A Simple Framework for Last Layer Retraining Based on Object Detection

Mohammad Azizmalayeri

Reza Abbasi

Amir Hosein Haji Mohammad Rezaie

172

31 Oct 2023

A Systematic Evaluation of GPT-4V's Multimodal Capability for Medical Image AnalysismedRxiv (medRxiv), 2023

Yingshu Li

Yunyi Liu

Zhanyu Wang

Xinyu Liang

Lei Wang

Lingqiao Liu

Leyang Cui

329

31 Oct 2023

ViCLEVR: A Visual Reasoning Dataset and Hybrid Multimodal Fusion Model for Visual Question Answering in Vietnamese

155

27 Oct 2023

3D-Aware Visual Question Answering about Parts, Poses and OcclusionsNeural Information Processing Systems (NeurIPS), 2023

318

27 Oct 2023

RIO: A Benchmark for Reasoning Intention-Oriented Objects in Open EnvironmentsNeural Information Processing Systems (NeurIPS), 2023

Jingkuan Song

239

26 Oct 2023

Context Does Matter: End-to-end Panoptic Narrative Grounding with Deformable Attention Refined Matching NetworkIndustrial Conference on Data Mining (IDM), 2023

Yiming Lin

Xiao-Bo Jin

Qiufeng Wang

Kaizhu Huang

155

25 Oct 2023

Video Referring Expression Comprehension via Transformer with Content-conditioned Query

267

25 Oct 2023

What's Left? Concept Grounding with Logic-Enhanced Foundation ModelsNeural Information Processing Systems (NeurIPS), 2023

Joy Hsu

Jiayuan Mao

Joshua B. Tenenbaum

Jiajun Wu

VLM ReLM LRM

384

24 Oct 2023

Recent Advances in Multi-modal 3D Scene Understanding: A Comprehensive Survey and Evaluation

Peng Wang

263

24 Oct 2023

OV-VG: A Benchmark for Open-Vocabulary Visual Grounding

Xiangtai Li

269

22 Oct 2023

LanPose: Language-Instructed 6D Object Pose Estimation for Robotic Assembly

281

20 Oct 2023

Multiscale Superpixel Structured Difference Graph Convolutional Network for VL Representation

299

20 Oct 2023

Weakly-Supervised Semantic Segmentation with Image-Level Labels: from Traditional Models to Foundation ModelsACM Computing Surveys (ACM Comput. Surv.), 2023

Zhaozheng Chen

Qianru Sun

VLM

426

19 Oct 2023

Learning from Rich Semantics and Coarse Locations for Long-tailed Object DetectionNeural Information Processing Systems (NeurIPS), 2023

Jianwei Yang

Zuxuan Wu

Lu Yuan

Yu-Gang Jiang

154

18 Oct 2023

InViG: Benchmarking Interactive Visual Grounding with 500K Human-Robot Interactions

192

18 Oct 2023

NICE: Improving Panoptic Narrative Detection and Segmentation with Cascading Collaborative LearningIEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2023

Jiayi Ji

349

17 Oct 2023

Zero-Shot Robotic Manipulation with Pretrained Image-Editing Diffusion ModelsInternational Conference on Learning Representations (ICLR), 2023

388

235

16 Oct 2023

Ferret: Refer and Ground Anything Anywhere at Any GranularityInternational Conference on Learning Representations (ICLR), 2023

Xianzhi Du

415

453

11 Oct 2023

CoT3DRef: Chain-of-Thoughts Data-Efficient 3D Visual GroundingInternational Conference on Learning Representations (ICLR), 2023

381

10 Oct 2023

InstructDET: Diversifying Referring Object Detection with Generalized InstructionsInternational Conference on Learning Representations (ICLR), 2023

...

434

08 Oct 2023

Lightweight In-Context Tuning for Multimodal Unified Models

144

08 Oct 2023

Expedited Training of Visual Conditioned Language Generation via Redundancy ReductionAnnual Meeting of the Association for Computational Linguistics (ACL), 2023

358

05 Oct 2023

CoDA: Collaborative Novel Box Discovery and Cross-modal Alignment for Open-vocabulary 3D Object DetectionNeural Information Processing Systems (NeurIPS), 2023

Yang Cao

Yihan Zeng

Hang Xu

Dan Xu

3DPC ObjD

243

04 Oct 2023

Unsupervised 3D Perception with 2D Vision-Language Distillation for Autonomous DrivingIEEE International Conference on Computer Vision (ICCV), 2023

240

25 Sep 2023

CATR: Combinatorial-Dependence Audio-Queried Transformer for Audio-Visual Video SegmentationACM Multimedia (ACM MM), 2023

Lei Chen

252

18 Sep 2023

PRE: Vision-Language Prompt Learning with Reparameterization Encoder

Anh Pham Thi Minh

An Duc Nguyen

Georgios Tzimiropoulos

VPVLM VLM

236

14 Sep 2023

Beyond Generation: Harnessing Text to Image Models for Object Detection and Segmentation

215

12 Sep 2023

Multi3DRefer: Grounding Text Description to Multiple 3D ObjectsIEEE International Conference on Computer Vision (ICCV), 2023

Yiming Zhang

ZeMing Gong

Angel X. Chang

394

134

11 Sep 2023

Language Prompt for Autonomous DrivingAAAI Conference on Artificial Intelligence (AAAI), 2023

Cheng-zhong Xu

474

127

08 Sep 2023

Box-based Refinement for Weakly Supervised and Unsupervised Localization TasksIEEE International Conference on Computer Vision (ICCV), 2023

Eyal Gomel

Tal Shaharabany

Lior Wolf

ObjD

350

07 Sep 2023

DetermiNet: A Large-Scale Diagnostic Dataset for Complex Visually-Grounded Referencing using DeterminersIEEE International Conference on Computer Vision (ICCV), 2023

Clarence Lee

M Ganesh Kumar

Cheston Tan

198

07 Sep 2023

A Joint Study of Phrase Grounding and Task Performance in Vision and Language Models

Noriyuki Kojima

Hadar Averbuch-Elor

Yoav Artzi

317

06 Sep 2023

Dense Object Grounding in 3D ScenesACM Multimedia (ACM MM), 2023

Wencan Huang

Daizong Liu

Wei Hu

259

05 Sep 2023

CoTDet: Affordance Knowledge Prompting for Task Driven Object DetectionIEEE International Conference on Computer Vision (ICCV), 2023

Jingyi Yu

215

03 Sep 2023

Catalog Phrase Grounding (CPG): Grounding of Product Textual Attributes in Product Images for e-commerce Vision-Language Applications

Wenyi Wu

Karim Bouyarmane

Ismail B. Tutar

30 Aug 2023

GREC: Generalized Referring Expression Comprehension

257

30 Aug 2023

Exploring Multi-Modal Contextual Knowledge for Open-Vocabulary Object DetectionIEEE Transactions on Image Processing (IEEE TIP), 2023

215

30 Aug 2023

Shatter and Gather: Learning Referring Image Segmentation with Text SupervisionIEEE International Conference on Computer Vision (ICCV), 2023

278

29 Aug 2023

UniPT: Universal Parallel Tuning for Transfer Learning with Efficient Parameter and MemoryComputer Vision and Pattern Recognition (CVPR), 2023

Huchuan Lu

241

28 Aug 2023

Towards Unified Token Learning for Vision-Language Tracking

Guorong Li

270

27 Aug 2023

Beyond One-to-One: Rethinking the Referring Image SegmentationIEEE International Conference on Computer Vision (ICCV), 2023

Jungong Han

Ping Luo

3DV

244

26 Aug 2023

Position-Enhanced Visual Instruction Tuning for Multimodal Large Language Models

Peng Li

Maosong Sun

Yang Liu

MLLM VLM

301

25 Aug 2023

How to Evaluate the Generalization of Detection? A Benchmark for Comprehensive Open-Vocabulary DetectionAAAI Conference on Artificial Intelligence (AAAI), 2023

197

25 Aug 2023

SCoRD: Subject-Conditional Relation Detection with Text-Augmented DataIEEE Workshop/Winter Conference on Applications of Computer Vision (WACV), 2023

253

24 Aug 2023

Grounded Entity-Landmark Adaptive Pre-training for Vision-and-Language NavigationIEEE International Conference on Computer Vision (ICCV), 2023

223

24 Aug 2023

HuBo-VLM: Unified Vision-Language Model designed for HUman roBOt interaction tasks

Xin Zhan

24 Aug 2023

RefEgo: Referring Expression Comprehension Dataset from First-Person Perception of Ego4DIEEE International Conference on Computer Vision (ICCV), 2023

257

23 Aug 2023

Deep Metric Loss for Multimodal LearningMachine-mediated learning (ML), 2023

Sehwan Moon

Hyun-Yong Lee

179

21 Aug 2023