Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2104.12763
Cited By
MDETR -- Modulated Detection for End-to-End Multi-Modal Understanding
26 April 2021
Aishwarya Kamath
Mannat Singh
Yann LeCun
Gabriel Synnaeve
Ishan Misra
Nicolas Carion
ObjD
VLM
Re-assign community
ArXiv
PDF
HTML
Papers citing
"MDETR -- Modulated Detection for End-to-End Multi-Modal Understanding"
50 / 607 papers shown
Title
VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks
Wen Wang
Zhe Chen
Xiaokang Chen
Jiannan Wu
Xizhou Zhu
...
Ping Luo
Tong Lu
Jie Zhou
Yu Qiao
Jifeng Dai
MLLM
VLM
33
454
0
18 May 2023
ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities
Peng Wang
Shijie Wang
Junyang Lin
Shuai Bai
Xiaohuan Zhou
Jingren Zhou
Xinggang Wang
Chang Zhou
VLM
MLLM
ObjD
16
114
0
18 May 2023
Visual Question Answering: A Survey on Techniques and Common Trends in Recent Literature
Ana Claudia Akemi Matsuki de Faria
Felype de Castro Bastos
Jose Victor Nogueira Alves da Silva
Vitor Lopes Fabris
Valeska Uchôa
Décio Gonccalves de Aguiar Neto
C. F. G. Santos
25
22
0
18 May 2023
Annotation-free Audio-Visual Segmentation
Jinxian Liu
Yu Wang
Chen Ju
Chaofan Ma
Ya-Qin Zhang
Weidi Xie
VOS
VLM
29
28
0
18 May 2023
Weakly-Supervised Visual-Textual Grounding with Semantic Prior Refinement
Davide Rigoni
Luca Parolari
Luciano Serafini
A. Sperduti
Lamberto Ballan
21
1
0
18 May 2023
UniS-MMC: Multimodal Classification via Unimodality-supervised Multimodal Contrastive Learning
Heqing Zou
Meng Shen
Chen Chen
Yuchen Hu
D. Rajan
Chng Eng Siong
SSL
32
15
0
16 May 2023
CLIP-VG: Self-paced Curriculum Adapting of CLIP for Visual Grounding
Linhui Xiao
Xiaoshan Yang
Fang Peng
Ming Yan
Yaowei Wang
Changsheng Xu
ObjD
VLM
29
30
0
15 May 2023
COLA: A Benchmark for Compositional Text-to-image Retrieval
Arijit Ray
Filip Radenovic
Abhimanyu Dubey
Bryan A. Plummer
Ranjay Krishna
Kate Saenko
CoGe
VLM
38
34
0
05 May 2023
Unified Model Learning for Various Neural Machine Translation
Yunlong Liang
Fandong Meng
Jinan Xu
Jiaan Wang
Yufeng Chen
Jie Zhou
29
1
0
04 May 2023
Energy-based Models are Zero-Shot Planners for Compositional Scene Rearrangement
N. Gkanatsios
Ayush Jain
Zhou Xian
Yunchu Zhang
C. Atkeson
Katerina Fragkiadaki
LM&Ro
98
31
0
27 Apr 2023
π
π
π
-Tuning: Transferring Multimodal Foundation Models with Optimal Multi-task Interpolation
Chengyue Wu
Teng Wang
Yixiao Ge
Zeyu Lu
Rui-Zhi Zhou
Ying Shan
Ping Luo
MoMe
80
35
0
27 Apr 2023
Zero-shot Unsupervised Transfer Instance Segmentation
Gyungin Shin
Samuel Albanie
Weidi Xie
ISeg
VLM
62
5
0
27 Apr 2023
Multimodal Grounding for Embodied AI via Augmented Reality Headsets for Natural Language Driven Task Planning
Selma Wanna
Fabian Parra
R. Valner
Karl Kruusamäe
Mitch Pryor
LM&Ro
22
2
0
26 Apr 2023
A Cookbook of Self-Supervised Learning
Randall Balestriero
Mark Ibrahim
Vlad Sobal
Ari S. Morcos
Shashank Shekhar
...
Pierre Fernandez
Amir Bar
Hamed Pirsiavash
Yann LeCun
Micah Goldblum
SyDa
FedML
SSL
31
272
0
24 Apr 2023
OmniLabel: A Challenging Benchmark for Language-Based Object Detection
S. Schulter
G. VijayKumarB.
Yumin Suh
Konstantinos M. Dafnis
Zhixing Zhang
Shiyu Zhao
Dimitris N. Metaxas
ObjD
22
11
0
22 Apr 2023
Transformer-Based Visual Segmentation: A Survey
Xiangtai Li
Henghui Ding
Haobo Yuan
Wenwei Zhang
Jiangmiao Pang
Guangliang Cheng
Kai-xiang Chen
Ziwei Liu
Chen Change Loy
ViT
MedIm
37
132
0
19 Apr 2023
Delving into Shape-aware Zero-shot Semantic Segmentation
Xinyu Liu
Beiwen Tian
Zhen Wang
Rui Wang
Kehua Sheng
Bo-Wen Zhang
Hao Zhao
Guyue Zhou
VLM
14
20
0
17 Apr 2023
On the Opportunities and Challenges of Foundation Models for Geospatial Artificial Intelligence
Gengchen Mai
Weiming Huang
Jin Sun
Suhang Song
Deepak Mishra
...
Yingjie Hu
Chris Cundy
Ziyuan Li
Rui Zhu
Ni Lao
AI4CE
22
118
0
13 Apr 2023
What does CLIP know about a red circle? Visual prompt engineering for VLMs
Aleksandar Shtedritski
Christian Rupprecht
Andrea Vedaldi
VLM
MLLM
24
140
0
13 Apr 2023
Verbs in Action: Improving verb understanding in video-language models
Liliane Momeni
Mathilde Caron
Arsha Nagrani
Andrew Zisserman
Cordelia Schmid
30
70
0
13 Apr 2023
WildRefer: 3D Object Localization in Large-scale Dynamic Scenes with Multi-modal Visual Data and Natural Language
Zhe-nan Lin
Xidong Peng
Peishan Cong
Ge Zheng
Yujin Sun
Yuenan Hou
Xinge Zhu
Sibei Yang
Yuexin Ma
VGen
82
4
0
12 Apr 2023
MoMo: A shared encoder Model for text, image and multi-Modal representations
Rakesh Chada
Zhao-Heng Zheng
P. Natarajan
ViT
19
4
0
11 Apr 2023
Detection Transformer with Stable Matching
Siyi Liu
Tianhe Ren
Jia-Yu Chen
Zhaoyang Zeng
Hao Zhang
...
Hongyang Li
Jun Huang
Hang Su
Jun Zhu
Lei Zhang
25
34
0
10 Apr 2023
DetCLIPv2: Scalable Open-Vocabulary Object Detection Pre-training via Word-Region Alignment
Lewei Yao
Jianhua Han
Xiaodan Liang
Danqian Xu
Wei Zhang
Zhenguo Li
Hang Xu
VLM
ObjD
CLIP
37
72
0
10 Apr 2023
CAVL: Learning Contrastive and Adaptive Representations of Vision and Language
Shentong Mo
Jingfei Xia
Ihor Markevych
CLIP
VLM
16
1
0
10 Apr 2023
ARNOLD: A Benchmark for Language-Grounded Task Learning With Continuous States in Realistic 3D Scenes
Ran Gong
Jiangyong Huang
Yizhou Zhao
Haoran Geng
Xiaofeng Gao
...
Ziheng Zhou
D. Terzopoulos
Song-Chun Zhu
Baoxiong Jia
Siyuan Huang
LM&Ro
37
45
0
09 Apr 2023
Mitigating Spurious Correlations in Multi-modal Models during Fine-tuning
Yu Yang
Besmira Nushi
Hamid Palangi
Baharan Mirzasoleiman
26
36
0
08 Apr 2023
V3Det: Vast Vocabulary Visual Detection Dataset
Jiaqi Wang
Pan Zhang
Tao Chu
Yuhang Cao
Yujie Zhou
Tong Wu
Bin Wang
Conghui He
Dahua Lin
VLM
ObjD
18
51
0
07 Apr 2023
Learning Instance-Level Representation for Large-Scale Multi-Modal Pretraining in E-commerce
Yang Jin
Yongzhi Li
Zehuan Yuan
Yadong Mu
22
13
0
06 Apr 2023
Learning to Name Classes for Vision and Language Models
Sarah Parisot
Yongxin Yang
Steven G. McDonagh
VLM
17
10
0
04 Apr 2023
Locate Then Generate: Bridging Vision and Language with Bounding Box for Scene-Text VQA
Yongxin Zhu
Z. Liu
Yukang Liang
Xin Li
Hao Liu
Changcun Bao
Linli Xu
16
6
0
04 Apr 2023
Probabilistic Prompt Learning for Dense Prediction
Hyeongjun Kwon
Taeyong Song
Somi Jeong
Jin-Hwa Kim
Jinhyun Jang
K. Sohn
VLM
19
18
0
03 Apr 2023
Vision-Language Models for Vision Tasks: A Survey
Jingyi Zhang
Jiaxing Huang
Sheng Jin
Shijian Lu
VLM
39
479
0
03 Apr 2023
What, when, and where? -- Self-Supervised Spatio-Temporal Grounding in Untrimmed Multi-Action Videos from Narrated Instructions
Brian Chen
Nina Shvetsova
Andrew Rouditchenko
D. Kondermann
Samuel Thomas
Shih-Fu Chang
Rogerio Feris
James R. Glass
Hilde Kuehne
27
7
0
29 Mar 2023
ViewRefer: Grasp the Multi-view Knowledge for 3D Visual Grounding with GPT and Prototype Guidance
Zoey Guo
Yiwen Tang
Renrui Zhang
Dong Wang
Zhigang Wang
Bin Zhao
Xuelong Li
33
53
0
29 Mar 2023
Gazeformer: Scalable, Effective and Fast Prediction of Goal-Directed Human Attention
Sounak Mondal
Zhibo Yang
Seoyoung Ahn
Dimitris Samaras
G. Zelinsky
Minh Hoai
17
29
0
27 Mar 2023
Query-Dependent Video Representation for Moment Retrieval and Highlight Detection
WonJun Moon
Sangeek Hyun
S. Park
Dongchan Park
Jae-Pil Heo
ViT
41
106
0
24 Mar 2023
Open-Vocabulary Object Detection using Pseudo Caption Labels
Han-Cheol Cho
Won Young Jhoo
Woohyun Kang
Byungseok Roh
VLM
ObjD
14
20
0
23 Mar 2023
LD-ZNet: A Latent Diffusion Approach for Text-Based Image Segmentation
K. Pnvr
Bharat Singh
P. Ghosh
Behjat Siddiquie
David Jacobs
DiffM
22
29
0
22 Mar 2023
Detecting the open-world objects with the help of the Brain
Shuailei Ma
Yuefeng Wang
Ying-yu Wei
Peihao Chen
Zhixiang Ye
Jiaqi Fan
Enming Zhang
Thomas H. Li
VLM
ObjD
16
2
0
21 Mar 2023
A Region-Prompted Adapter Tuning for Visual Abductive Reasoning
Hao Zhang
Yeo Keat Ee
Basura Fernando
VLM
27
3
0
18 Mar 2023
Investigating the Role of Attribute Context in Vision-Language Models for Object Recognition and Detection
Kyle Buettner
Adriana Kovashka
20
0
0
17 Mar 2023
A Simple Framework for Open-Vocabulary Segmentation and Detection
Hao Zhang
Feng Li
Xueyan Zou
Siyi Liu
Chun-yue Li
Jianfeng Gao
Jianwei Yang
Lei Zhang
ObjD
VLM
15
149
0
14 Mar 2023
Medical Phrase Grounding with Region-Phrase Context Contrastive Alignment
Zhihao Chen
Yangqiaoyu Zhou
A. Tran
Junting Zhao
Liang Wan
...
Lionel T. E. Cheng
C. Thng
Xinxing Xu
Yong-Jin Liu
H. Fu
MedIm
28
21
0
14 Mar 2023
Audio Visual Language Maps for Robot Navigation
Chen Huang
Oier Mees
Andy Zeng
Wolfram Burgard
VGen
60
32
0
13 Mar 2023
Universal Instance Perception as Object Discovery and Retrieval
B. Yan
Yi-Xin Jiang
Jiannan Wu
D. Wang
Ping Luo
Zehuan Yuan
Huchuan Lu
VOS
VLM
LRM
27
161
0
12 Mar 2023
Learning Grounded Vision-Language Representation for Versatile Understanding in Untrimmed Videos
Teng Wang
Jinrui Zhang
Feng Zheng
Wenhao Jiang
Ran Cheng
Ping Luo
VLM
26
11
0
11 Mar 2023
Semantics-Aware Dynamic Localization and Refinement for Referring Image Segmentation
Zhao Yang
Jiaqi Wang
Yansong Tang
Kai-xiang Chen
Hengshuang Zhao
Philip H. S. Torr
31
23
0
11 Mar 2023
Object-Aware Distillation Pyramid for Open-Vocabulary Object Detection
Luting Wang
Yi Liu
Penghui Du
Zihan Ding
Yue Liao
Qiaosong Qi
Biaolong Chen
Si Liu
ObjD
VLM
68
62
0
10 Mar 2023
Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection
Shilong Liu
Zhaoyang Zeng
Tianhe Ren
Feng Li
Hao Zhang
...
Chun-yue Li
Jianwei Yang
Hang Su
Jun Zhu
Lei Zhang
ObjD
49
1,804
0
09 Mar 2023
Previous
1
2
3
...
7
8
9
...
11
12
13
Next