v1v2 (latest)

MDETR -- Modulated Detection for End-to-End Multi-Modal Understanding

IEEE International Conference on Computer Vision (ICCV), 2021

26 April 2021

ArXiv (abs)PDF HTML Github (1008★)

Papers citing "MDETR -- Modulated Detection for End-to-End Multi-Modal Understanding"

50 / 678 papers shown

Towards Grounded Visual Spatial Reasoning in Multi-Modal Vision Language Models

Navid Rajabi

Jana Kosecka

VLM

275

18 Aug 2023

RLIPv2: Fast Scaling of Relational Language-Image Pre-trainingIEEE International Conference on Computer Vision (ICCV), 2023

244

18 Aug 2023

Tem-adapter: Adapting Image-Text Pretraining for Video Question AnswerIEEE International Conference on Computer Vision (ICCV), 2023

298

16 Aug 2023

Helping Hands: An Object-Aware Ego-Centric Video Recognition ModelIEEE International Conference on Computer Vision (ICCV), 2023

222

15 Aug 2023

Taming Self-Training for Open-Vocabulary Object DetectionComputer Vision and Pattern Recognition (CVPR), 2023

364

11 Aug 2023

Exploring Visual Pre-training for Robot Manipulation: Datasets, Models and MethodsIEEE/RJS International Conference on Intelligent RObots and Systems (IROS), 2023

210

07 Aug 2023

The All-Seeing Project: Towards Panoptic Visual Recognition and Understanding of the Open WorldInternational Conference on Learning Representations (ICLR), 2023

...

Zhiguo Cao

Yu Qiao

270

118

03 Aug 2023

Grounded Image Text Matching with Mismatched Relation ReasoningIEEE International Conference on Computer Vision (ICCV), 2023

257

02 Aug 2023

Towards General Visual-Linguistic Face Forgery DetectionComputer Vision and Pattern Recognition (CVPR), 2023

265

31 Jul 2023

Bridging the Gap: Exploring the Capabilities of Bridge-Architectures for Complex Visual Reasoning Tasks

204

31 Jul 2023

JOTR: 3D Joint Contrastive Learning with Transformers for Occluded Human Mesh RecoveryIEEE International Conference on Computer Vision (ICCV), 2023

Jiahao Li

Zongxin Yang

Xiaohan Wang

Jianxin Ma

Chang Zhou

Yi Yang

256

31 Jul 2023

UnIVAL: Unified Model for Image, Video, Audio and Language Tasks

308

30 Jul 2023

Scaling Up and Distilling Down: Language-Guided Robot Skill AcquisitionConference on Robot Learning (CoRL), 2023

Huy Ha

Peter R. Florence

Shuran Song

LM&Ro

273

208

26 Jul 2023

Foundational Models Defining a New Era in Vision: A Survey and Outlook

Muhammad Awais

Muzammal Naseer

Salman Khan

Rao Muhammad Anwer

Hisham Cholakkal

430

152

25 Jul 2023

3DRP-Net: 3D Relative Position-aware Network for 3D Visual GroundingConference on Empirical Methods in Natural Language Processing (EMNLP), 2023

Xize Cheng

Zhou Zhao

176

25 Jul 2023

Audio-aware Query-enhanced Transformer for Audio-Visual Segmentation

268

25 Jul 2023

Described Object Detection: Liberating Object Detection with Flexible ExpressionsNeural Information Processing Systems (NeurIPS), 2023

242

24 Jul 2023

Iterative Robust Visual Grounding with Masked Reference based Centerpoint Supervision

Xiangtai Li

275

23 Jul 2023

Advancing Visual Grounding with Scene Knowledge: Benchmark and MethodComputer Vision and Pattern Recognition (CVPR), 2023

Xiang Wan

175

21 Jul 2023

Bridging Vision and Language Encoders: Parameter-Efficient Tuning for Referring Image SegmentationIEEE International Conference on Computer Vision (ICCV), 2023

Zunnan Xu

Yong Zhang

Xiang Wan

228

21 Jul 2023

Divert More Attention to Vision-Language Object TrackingIEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2023

261

19 Jul 2023

Distilling Coarse-to-Fine Semantic Matching Knowledge for Weakly Supervised 3D Visual GroundingIEEE International Conference on Computer Vision (ICCV), 2023

Xize Cheng

Zhou Zhao

188

18 Jul 2023

A Survey on Open-Vocabulary Detection and Segmentation: Past, Present, and FutureIEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2023

Chaoyang Zhu

Long Chen

ObjD VLM

510

18 Jul 2023

Multimodal Diffusion Segmentation Model for Object Segmentation from Manipulation InstructionsIEEE/RJS International Conference on Intelligent RObots and Systems (IROS), 2023

227

17 Jul 2023

BUS:Efficient and Effective Vision-language Pre-training with Bottom-Up Patch SummarizationIEEE International Conference on Computer Vision (ICCV), 2023

Fei Huang

190

17 Jul 2023

Bootstrapping Vision-Language Learning with Decoupled Language Pre-trainingNeural Information Processing Systems (NeurIPS), 2023

388

13 Jul 2023

VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language ModelsConference on Robot Learning (CoRL), 2023

Wenlong Huang

Chen Wang

Ruohan Zhang

Yunzhu Li

Jiajun Wu

Li Fei-Fei

LM&Ro

407

750

12 Jul 2023

GVCCI: Lifelong Learning of Visual Grounding for Language-Guided Robotic ManipulationIEEE/RJS International Conference on Intelligent RObots and Systems (IROS), 2023

211

12 Jul 2023

Prototypical Contrastive Transfer Learning for Multimodal Language UnderstandingIEEE/RJS International Conference on Intelligent RObots and Systems (IROS), 2023

Seitaro Otsuki

Shintaro Ishikawa

K. Sugiura

186

12 Jul 2023

GPT4RoI: Instruction Tuning Large Language Model on Region-of-Interest

913

317

07 Jul 2023

Vision Language Transformers: A Survey

Clayton Fields

C. Kennington

VLM

182

06 Jul 2023

Distilling Large Vision-Language Model with Out-of-Distribution GeneralizabilityIEEE International Conference on Computer Vision (ICCV), 2023

344

06 Jul 2023

Human Inspired Progressive Alignment and Comparative Learning for Grounded Word AcquisitionAnnual Meeting of the Association for Computational Linguistics (ACL), 2023

279

05 Jul 2023

Robots That Ask For Help: Uncertainty Alignment for Large Language Model PlannersConference on Robot Learning (CoRL), 2023

Allen Z. Ren

Anushri Dixit

Alexandra Bodrova

Sumeet Singh

Stephen Tu

...

Dorsa Sadigh

Anirudha Majumdar

487

307

04 Jul 2023

AVSegFormer: Audio-Visual Segmentation with TransformerAAAI Conference on Artificial Intelligence (AAAI), 2023

372

03 Jul 2023

CoPL: Contextual Prompt Learning for Vision-Language UnderstandingAAAI Conference on Artificial Intelligence (AAAI), 2023

Balaji Vasan Srinivasan

VLM

270

03 Jul 2023

Statler: State-Maintaining Language Models for Embodied ReasoningIEEE International Conference on Robotics and Automation (ICRA), 2023

269

30 Jun 2023

Look, Remember and Reason: Grounded reasoning in videos with language modelsInternational Conference on Learning Representations (ICLR), 2023

Apratim Bhattacharyya

470

30 Jun 2023

Towards Open Vocabulary Learning: A SurveyIEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2023

Xiangtai Li

...

Jiangning Zhang

406

218

28 Jun 2023

REFLECT: Summarizing Robot Experiences for Failure Explanation and CorrectionConference on Robot Learning (CoRL), 2023

Zeyi Liu

Arpit Bahety

Shuran Song

LRM

459

189

27 Jun 2023

Kosmos-2: Grounding Multimodal Large Language Models to the WorldInternational Conference on Learning Representations (ICLR), 2023

396

1,026

26 Jun 2023

Switch-BERT: Learning to Model Multimodal Interactions by Switching Attention and InputEuropean Conference on Computer Vision (ECCV), 2023

103

25 Jun 2023

DesCo: Learning Object Recognition with Rich Language DescriptionsNeural Information Processing Systems (NeurIPS), 2023

185

24 Jun 2023

RS5M and GeoRSCLIP: A Large Scale Vision-Language Dataset and A Large Vision-Language Model for Remote SensingIEEE Transactions on Geoscience and Remote Sensing (TGRS), 2023

1.2K

153

20 Jun 2023

Visually-Guided Sound Source Separation with Audio-Visual Predictive CodingIEEE Transactions on Neural Networks and Learning Systems (TNNLS), 2023

Zengjie Song

Zhaoxiang Zhang

169

19 Jun 2023

CorNav: Autonomous Agent with Self-Corrected Planning for Zero-Shot Vision-and-Language NavigationAnnual Meeting of the Association for Computational Linguistics (ACL), 2023

Xiwen Liang

Liang Ma

Shanshan Guo

Jianhua Han

Hang Xu

Shikui Ma

Xiaodan Liang

LM&Ro LLMAG

367

17 Jun 2023

Scaling Open-Vocabulary Object DetectionNeural Information Processing Systems (NeurIPS), 2023

418

309

16 Jun 2023

Recurrent Action Transformer with Memory

393

15 Jun 2023

Exploring the Application of Large-scale Pre-trained Models on Adverse Weather RemovalIEEE Transactions on Image Processing (IEEE TIP), 2023

230

15 Jun 2023

World-to-Words: Grounded Open Vocabulary Acquisition through Fast Mapping in Vision-Language ModelsAnnual Meeting of the Association for Computational Linguistics (ACL), 2023

196

14 Jun 2023