Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2104.12763
Cited By
MDETR -- Modulated Detection for End-to-End Multi-Modal Understanding
26 April 2021
Aishwarya Kamath
Mannat Singh
Yann LeCun
Gabriel Synnaeve
Ishan Misra
Nicolas Carion
ObjD
VLM
Re-assign community
ArXiv
PDF
HTML
Papers citing
"MDETR -- Modulated Detection for End-to-End Multi-Modal Understanding"
50 / 607 papers shown
Title
Zero-shot Object Navigation with Vision-Language Models Reasoning
Congcong Wen
Yisiyuan Huang
Hao Huang
Yanjia Huang
Shuaihang Yuan
Yu Hao
Hui Lin
Yu-Shen Liu
Yi Fang
LM&Ro
40
7
0
24 Oct 2024
Griffon-G: Bridging Vision-Language and Vision-Centric Tasks via Large Multimodal Models
Yufei Zhan
Hongyin Zhao
Yousong Zhu
Fan Yang
Ming Tang
Jinqiao Wang
MLLM
43
1
0
21 Oct 2024
Open-vocabulary vs. Closed-set: Best Practice for Few-shot Object Detection Considering Text Describability
Yusuke Hosoya
Masanori Suganuma
Takayuki Okatani
ObjD
16
0
0
20 Oct 2024
Temporal-Enhanced Multimodal Transformer for Referring Multi-Object Tracking and Segmentation
Changcheng Xiao
Qiong Cao
Yujie Zhong
Xiang Zhang
Tao Wang
Canqun Yang
L. Lan
23
0
0
17 Oct 2024
Context-Infused Visual Grounding for Art
Selina Khan
Nanne van Noord
ObjD
27
1
0
16 Oct 2024
MMAR: Towards Lossless Multi-Modal Auto-Regressive Probabilistic Modeling
Jian Yang
Dacheng Yin
Yizhou Zhou
Fengyun Rao
Wei-dong Zhai
Yang Cao
Zheng-jun Zha
DiffM
28
6
0
14 Oct 2024
DINTR: Tracking via Diffusion-based Interpolation
Pha Nguyen
Ngan Le
J. Cothren
Alper Yilmaz
Khoa Luu
DiffM
38
0
0
14 Oct 2024
DFIMat: Decoupled Flexible Interactive Matting in Multi-Person Scenarios
Siyi Jiao
Wenzheng Zeng
Changxin Gao
Nong Sang
28
1
0
13 Oct 2024
OneRef: Unified One-tower Expression Grounding and Segmentation with Mask Referring Modeling
Linhui Xiao
Xiaoshan Yang
Fang Peng
Yaowei Wang
Changsheng Xu
ObjD
24
5
0
10 Oct 2024
G
2
^{2}
2
TR: Generalized Grounded Temporal Reasoning for Robot Instruction Following by Combining Large Pre-trained Models
Riya Arora
N. N.
Aman Tambi
Sandeep S. Zachariah
Souvik Chakraborty
Rohan Paul
LM&Ro
28
0
0
10 Oct 2024
Structured Spatial Reasoning with Open Vocabulary Object Detectors
Negar Nejatishahidin
Madhukar Reddy Vongala
Jana Kosecka
30
2
0
09 Oct 2024
Grounding Partially-Defined Events in Multimodal Data
Kate Sanders
Reno Kriz
David Etter
Hannah Recknor
Alexander Martin
Cameron Carpenter
Jingyang Lin
Benjamin Van Durme
22
2
0
07 Oct 2024
ChatVTG: Video Temporal Grounding via Chat with Video Dialogue Large Language Models
Mengxue Qu
Xiaodong Chen
Wu Liu
Alicia Li
Yao Zhao
37
13
0
01 Oct 2024
You Only Speak Once to See
Wenhao Yang
Jianguo Wei
Wenhuan Lu
Lei Li
VOS
23
1
0
27 Sep 2024
Harnessing Shared Relations via Multimodal Mixup Contrastive Learning for Multimodal Classification
Raja Kumar
Raghav Singhal
Pranamya Kulkarni
Deval Mehta
Kshitij Jadhav
13
0
0
26 Sep 2024
SimVG: A Simple Framework for Visual Grounding with Decoupled Multi-modal Fusion
Ming Dai
Lingfeng Yang
Yihao Xu
Zhenhua Feng
Wankou Yang
ObjD
27
9
0
26 Sep 2024
FineCops-Ref: A new Dataset and Task for Fine-Grained Compositional Referring Expression Comprehension
Junzhuo Liu
X. Yang
Weiwei Li
Peng Wang
ObjD
44
3
0
23 Sep 2024
Discovering Object Attributes by Prompting Large Language Models with Perception-Action APIs
A. Mavrogiannis
Dehao Yuan
Yiannis Aloimonos
LM&Ro
27
0
0
23 Sep 2024
LLM-wrapper: Black-Box Semantic-Aware Adaptation of Vision-Language Models for Referring Expression Comprehension
Amaia Cardiel
Éloi Zablocki
Oriane Siméoni
Elias Ramzi
Matthieu Cord
VLM
23
0
0
18 Sep 2024
Robot Manipulation in Salient Vision through Referring Image Segmentation and Geometric Constraints
Chen Jiang
Allie Luo
Martin Jägersand
15
0
0
17 Sep 2024
Mamba-YOLO-World: Marrying YOLO-World with Mamba for Open-Vocabulary Detection
Haoxuan Wang
Q. He
Jinlong Peng
Hao Yang
Mingmin Chi
Yabiao Wang
Mamba
34
1
0
13 Sep 2024
VLTP: Vision-Language Guided Token Pruning for Task-Oriented Segmentation
Hanning Chen
Yang Ni
Wenjun Huang
Yezi Liu
SungHeon Jeong
Fei Wen
Nathaniel Bastian
Hugo Latapie
Mohsen Imani
VLM
32
4
0
13 Sep 2024
An Attribute-Enriched Dataset and Auto-Annotated Pipeline for Open Detection
Pengfei Qi
Yifei Zhang
Wenqiang Li
Youwen Hu
Kunlong Bai
ObjD
20
0
0
10 Sep 2024
Context is the Key: Backdoor Attacks for In-Context Learning with Vision Transformers
Gorka Abad
S. Picek
Lorenzo Cavallaro
A. Urbieta
SILM
37
0
0
06 Sep 2024
Make Graph-based Referring Expression Comprehension Great Again through Expression-guided Dynamic Gating and Regression
Jingcheng Ke
Dele Wang
Jun-Cheng Chen
I-Hong Jhuo
Chia-Wen Lin
Yen-Yu Lin
31
0
0
05 Sep 2024
More Pictures Say More: Visual Intersection Network for Open Set Object Detection
Bingcheng Dong
Yuning Ding
Jinrong Zhang
Sifan Zhang
Shenglan Liu
ObjD
33
0
0
26 Aug 2024
LowCLIP: Adapting the CLIP Model Architecture for Low-Resource Languages in Multimodal Image Retrieval Task
Ali Asgarov
Samir Rustamov
VLM
14
1
0
25 Aug 2024
R2G: Reasoning to Ground in 3D Scenes
Yixuan Li
Zan Wang
Wei Liang
41
2
0
24 Aug 2024
D-RMGPT: Robot-assisted collaborative tasks driven by large multimodal models
Matteo Forlini
Mihail Babcinschi
Giacomo Palmieri
Pedro Neto
29
1
0
21 Aug 2024
On the Potential of Open-Vocabulary Models for Object Detection in Unusual Street Scenes
Sadia Ilyas
Ido Freeman
Matthias Rottmann
ObjD
43
3
0
20 Aug 2024
Towards Flexible Visual Relationship Segmentation
Fangrui Zhu
Jianwei Yang
Huaizu Jiang
VOS
29
1
0
15 Aug 2024
An Efficient and Effective Transformer Decoder-Based Framework for Multi-Task Visual Grounding
Wei Chen
Mahdieh Hatamian
Yu Wu
37
3
0
02 Aug 2024
Look Hear: Gaze Prediction for Speech-directed Human Attention
Sounak Mondal
Seoyoung Ahn
Zhibo Yang
Niranjan Balasubramanian
Dimitris Samaras
G. Zelinsky
Minh Hoai
34
1
0
28 Jul 2024
PartGLEE: A Foundation Model for Recognizing and Parsing Any Objects
Junyi Li
Junfeng Wu
Weizhi Zhao
Song Bai
Xiang Bai
31
1
0
23 Jul 2024
HAPFI: History-Aware Planning based on Fused Information
Sujin Jeon
Suyeon Shin
Byoung-Tak Zhang
25
0
0
23 Jul 2024
Accelerating Pre-training of Multimodal LLMs via Chain-of-Sight
Ziyuan Huang
Kaixiang Ji
Biao Gong
Zhiwu Qing
Qinglong Zhang
Kecheng Zheng
Jian Wang
Jingdong Chen
Ming Yang
LRM
34
1
0
22 Jul 2024
Weak-to-Strong Compositional Learning from Generative Models for Language-based Object Detection
Kwanyong Park
Kuniaki Saito
Donghyun Kim
VLM
CoGe
37
0
0
21 Jul 2024
Learning Visual Grounding from Generative Vision and Language Model
Shijie Wang
Dahun Kim
A. Taalimi
Chen Sun
Weicheng Kuo
ObjD
32
5
0
18 Jul 2024
SDPT: Synchronous Dual Prompt Tuning for Fusion-based Visual-Language Pre-trained Models
Yang Zhou
Yongjian Wu
Jiya Saiyin
Bingzheng Wei
Maode Lai
Eric Chang
Yan Xu
VLM
30
0
0
16 Jul 2024
OVLW-DETR: Open-Vocabulary Light-Weighted Detection Transformer
Yu Wang
Xiangbo Su
Qiang Chen
Xinyu Zhang
Teng Xi
Kun Yao
Errui Ding
Gang Zhang
Jingdong Wang
ObjD
VLM
36
0
0
15 Jul 2024
Pathformer3D: A 3D Scanpath Transformer for 360° Images
Rong Quan
Yantao Lai
Mengyu Qiu
Dong Liang
ViT
22
0
0
15 Jul 2024
Plain-Det: A Plain Multi-Dataset Object Detector
Cheng Shi
Yuchen Zhu
Sibei Yang
ObjD
VLM
24
0
0
14 Jul 2024
Layer-Wise Relevance Propagation with Conservation Property for ResNet
Seitaro Otsuki
T. Iida
Félix Doublet
Tsubasa Hirakawa
Takayoshi Yamashita
H. Fujiyoshi
Komei Sugiura
FAtt
38
4
0
12 Jul 2024
Textual Query-Driven Mask Transformer for Domain Generalized Segmentation
Byeonghyun Pak
Byeongju Woo
Sunghwan Kim
Dae-Hwan Kim
Hoseong Kim
37
3
0
12 Jul 2024
SHERL: Synthesizing High Accuracy and Efficient Memory for Resource-Limited Transfer Learning
Haiwen Diao
Bo Wan
Xu Jia
Yunzhi Zhuge
Ying Zhang
Huchuan Lu
Long Chen
VLM
37
4
0
10 Jul 2024
ActionVOS: Actions as Prompts for Video Object Segmentation
Liangyang Ouyang
Ruicong Liu
Yifei Huang
Ryosuke Furuta
Yoichi Sato
VOS
31
2
0
10 Jul 2024
Aligning Cyber Space with Physical World: A Comprehensive Survey on Embodied AI
Y. Liu
Weixing Chen
Yongjie Bai
Xiaodan Liang
Guanbin Li
Wen Gao
Liang Lin
LM&Ro
SyDa
AI4CE
48
47
0
09 Jul 2024
Multi-Object Hallucination in Vision-Language Models
Xuweiyi Chen
Ziqiao Ma
Xuejun Zhang
Sihan Xu
Shengyi Qian
Jianing Yang
David Fouhey
Joyce Chai
47
15
0
08 Jul 2024
Described Spatial-Temporal Video Detection
Wei Ji
Xiangyan Liu
Yingfei Sun
Jiajun Deng
You Qin
Ammar Nuwanna
Mengyao Qiu
Lina Wei
Roger Zimmermann
24
2
0
08 Jul 2024
FALIP: Visual Prompt as Foveal Attention Boosts CLIP Zero-Shot Performance
Jiedong Zhuang
Jiaqi Hu
Lianrui Mu
Rui Hu
Xiaoyu Liang
Jiangnan Ye
Haoji Hu
CLIP
VLM
29
2
0
08 Jul 2024
Previous
1
2
3
4
5
...
11
12
13
Next