Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2406.19389
Cited By
OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and Understanding
27 June 2024
Tao Zhang
Xiangtai Li
Hao Fei
Haobo Yuan
Shengqiong Wu
Shunping Ji
Chen Change Loy
Shuicheng Yan
LRM
MLLM
VLM
Re-assign community
ArXiv
PDF
HTML
Papers citing
"OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and Understanding"
31 / 31 papers shown
Title
LGD: Leveraging Generative Descriptions for Zero-Shot Referring Image Segmentation
Jiachen Li
Qing Xie
Xiaohan Yu
Hongyun Wang
Jinyu Xu
Yongjian Liu
ObjD
69
0
0
20 Apr 2025
HistLLM: A Unified Framework for LLM-Based Multimodal Recommendation with User History Encoding and Compression
Chen Zhang
Bo Hu
Weidong Chen
Zhendong Mao
29
0
0
14 Apr 2025
Vector-Quantized Vision Foundation Models for Object-Centric Learning
Rongzhen Zhao
V. Wang
Juho Kannala
J. Pajarinen
OCL
VLM
91
0
0
27 Feb 2025
Omni-RGPT: Unifying Image and Video Region-level Understanding via Token Marks
Miran Heo
Min-Hung Chen
De-An Huang
Sifei Liu
Subhashree Radhakrishnan
Seon Joo Kim
Yu-Chun Wang
Ryo Hachiuma
ObjD
VLM
100
2
0
14 Jan 2025
Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos
Haobo Yuan
X. Li
Tao Zhang
Zilong Huang
Shilin Xu
S. Ji
Yunhai Tong
Lu Qi
Jiashi Feng
Ming Yang
VLM
65
11
0
07 Jan 2025
Multi-modal and Multi-scale Spatial Environment Understanding for Immersive Visual Text-to-Speech
Rui Liu
Shuwei He
Yifan Hu
H. Li
VLM
79
1
0
16 Dec 2024
ViLLa: Video Reasoning Segmentation with Large Language Model
Rongkun Zheng
Lu Qi
Xi Chen
Yi Wang
Kun Wang
Yu Qiao
Hengshuang Zhao
VOS
LRM
40
2
0
18 Jul 2024
F-LMM: Grounding Frozen Large Multimodal Models
Size Wu
Sheng Jin
Wenwei Zhang
Lumin Xu
Wentao Liu
Wei Li
Chen Change Loy
MLLM
56
12
0
09 Jun 2024
Towards Semantic Equivalence of Tokenization in Multimodal LLM
Shengqiong Wu
Hao Fei
Xiangtai Li
Jiayi Ji
Hanwang Zhang
Tat-Seng Chua
Shuicheng Yan
MLLM
52
25
0
07 Jun 2024
LLM-Seg: Bridging Image Segmentation and Large Language Model Reasoning
Junchi Wang
Lei Ke
MLLM
LRM
VLM
28
3
0
12 Apr 2024
LaSagnA: Language-based Segmentation Assistant for Complex Queries
Cong Wei
Haoxian Tan
Yujie Zhong
Yujiu Yang
Lin Ma
32
14
0
12 Apr 2024
DVIS-DAQ: Improving Video Segmentation via Dynamic Anchor Queries
Yikang Zhou
Tao Zhang
Shunping Ji
Shuicheng Yan
Xiangtai Li
23
5
0
29 Mar 2024
Draw-and-Understand: Leveraging Visual Prompts to Enable MLLMs to Comprehend What You Want
Weifeng Lin
Xinyu Wei
Ruichuan An
Peng Gao
Bocheng Zou
Yulin Luo
Siyuan Huang
Shanghang Zhang
Hongsheng Li
VLM
40
31
0
29 Mar 2024
Generalizable Entity Grounding via Assistance of Large Language Model
Lu Qi
Yi-Wen Chen
Lehan Yang
Tiancheng Shen
Xiangtai Li
Weidong Guo
Yu-Syuan Xu
Ming-Hsuan Yang
VLM
39
9
0
04 Feb 2024
InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Model
Xiao-wen Dong
Pan Zhang
Yuhang Zang
Yuhang Cao
Bin Wang
...
Conghui He
Xingcheng Zhang
Yu Qiao
Dahua Lin
Jiaqi Wang
VLM
MLLM
70
89
0
29 Jan 2024
OMG-Seg: Is One Model Good Enough For All Segmentation?
Xiangtai Li
Haobo Yuan
Wei Li
Henghui Ding
Size Wu
Wenwei Zhang
Yining Li
Kai Chen
Chen Change Loy
VLM
MLLM
ViT
61
48
0
18 Jan 2024
LLaVA-Grounding: Grounded Visual Chat with Large Multimodal Models
Hao Zhang
Hongyang Li
Feng Li
Tianhe Ren
Xueyan Zou
...
Shijia Huang
Jianfeng Gao
Lei Zhang
Chun-yue Li
Jianwei Yang
85
30
0
05 Dec 2023
Video-LLaVA: Learning United Visual Representation by Alignment Before Projection
Bin Lin
Yang Ye
Bin Zhu
Jiaxi Cui
Munan Ning
Peng Jin
Li-ming Yuan
VLM
MLLM
182
576
0
16 Nov 2023
In-Context Learning Unlocked for Diffusion Models
Zhendong Wang
Yifan Jiang
Yadong Lu
Yelong Shen
Pengcheng He
Weizhu Chen
Zhangyang Wang
Mingyuan Zhou
VLM
DiffM
78
68
0
01 May 2023
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
Junnan Li
Dongxu Li
Silvio Savarese
Steven C. H. Hoi
VLM
MLLM
244
4,186
0
30 Jan 2023
Towards Robust Referring Video Object Segmentation with Cyclic Relational Consensus
Xiang Li
Jinglu Wang
Xiaohao Xu
Xiao Li
Bhiksha Raj
Yan Lu
VOS
34
20
0
04 Jul 2022
TubeFormer-DeepLab: Video Mask Transformer
Dahun Kim
Jun Xie
Huiyu Wang
Siyuan Qiao
Qihang Yu
Hong-Seok Kim
Hartwig Adam
In So Kweon
Liang-Chieh Chen
ViT
MedIm
64
40
0
30 May 2022
BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation
Junnan Li
Dongxu Li
Caiming Xiong
S. Hoi
MLLM
BDL
VLM
CLIP
380
4,010
0
28 Jan 2022
LAVT: Language-Aware Vision Transformer for Referring Image Segmentation
Zhao Yang
Jiaqi Wang
Yansong Tang
Kai-xiang Chen
Hengshuang Zhao
Philip H. S. Torr
115
308
0
04 Dec 2021
MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer
Sachin Mehta
Mohammad Rastegari
ViT
178
1,148
0
05 Oct 2021
Pix2seq: A Language Modeling Framework for Object Detection
Ting-Li Chen
Saurabh Saxena
Lala Li
David J. Fleet
Geoffrey E. Hinton
MLLM
ViT
VLM
223
341
0
22 Sep 2021
Learning to Prompt for Vision-Language Models
Kaiyang Zhou
Jingkang Yang
Chen Change Loy
Ziwei Liu
VPVLM
CLIP
VLM
319
2,108
0
02 Sep 2021
Context-Aware Mixup for Domain Adaptive Semantic Segmentation
Qianyu Zhou
Zhengyang Feng
Qiqi Gu
Jiangmiao Pang
Guangliang Cheng
Xuequan Lu
Jianping Shi
Lizhuang Ma
46
97
0
08 Aug 2021
Collaborative Video Object Segmentation by Multi-Scale Foreground-Background Integration
Zongxin Yang
Yunchao Wei
Yi Yang
VOS
31
160
0
13 Oct 2020
Multi-task Collaborative Network for Joint Referring Expression Comprehension and Segmentation
Gen Luo
Yiyi Zhou
Xiaoshuai Sun
Liujuan Cao
Chenglin Wu
Cheng Deng
Rongrong Ji
ObjD
141
282
0
19 Mar 2020
Semantic Understanding of Scenes through the ADE20K Dataset
Bolei Zhou
Hang Zhao
Xavier Puig
Tete Xiao
Sanja Fidler
Adela Barriuso
Antonio Torralba
SSeg
243
1,817
0
18 Aug 2016
1