Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2312.00784
Cited By
ViP-LLaVA: Making Large Multimodal Models Understand Arbitrary Visual Prompts
1 December 2023
Mu Cai
Haotian Liu
Dennis Park
Siva Karthik Mustikovela
Gregory P. Meyer
Yuning Chai
Yong Jae Lee
VLM
LRM
MLLM
Re-assign community
ArXiv
PDF
HTML
Papers citing
"ViP-LLaVA: Making Large Multimodal Models Understand Arbitrary Visual Prompts"
23 / 73 papers shown
Title
MGIMM: Multi-Granularity Instruction Multimodal Model for Attribute-Guided Remote Sensing Image Detailed Description
Cong Yang
Zuchao Li
Lefei Zhang
32
1
0
07 Jun 2024
OLIVE: Object Level In-Context Visual Embeddings
Timothy Ossowski
Junjie Hu
OCL
VLM
52
0
0
02 Jun 2024
Dr-LLaVA: Visual Instruction Tuning with Symbolic Clinical Grounding
Shenghuan Sun
Gregory M. Goldgof
Alexander Schubert
Zhiqing Sun
Thomas Hartvigsen
A. Butte
Ahmed Alaa
LM&MA
27
4
0
29 May 2024
Matryoshka Multimodal Models
Mu Cai
Jianwei Yang
Jianfeng Gao
Yong Jae Lee
VLM
39
25
0
27 May 2024
VoCoT: Unleashing Visually Grounded Multi-Step Reasoning in Large Multi-Modal Models
Zejun Li
Ruipu Luo
Jiwen Zhang
Minghui Qiu
Zhongyu Wei
Zhongyu Wei
LRM
MLLM
55
7
0
27 May 2024
PerSense: Personalized Instance Segmentation in Dense Images
Muhammad Ibraheem Siddiqui
Muhammad Umer Sheikh
Hassan Abid
Muhammad Haris Khan
VLM
55
0
0
22 May 2024
CityLLaVA: Efficient Fine-Tuning for VLMs in City Scenario
Zhizhao Duan
Hao Cheng
Duo Xu
Xi Wu
Xiangxie Zhang
Xi Ye
Zhen Xie
24
6
0
06 May 2024
Interpreting COVID Lateral Flow Tests' Results with Foundation Models
Stuti Pandey
Josh Myers-Dean
Jarek Reynolds
Danna Gurari
31
0
0
21 Apr 2024
Behavior Trees Enable Structured Programming of Language Model Agents
Richard Kelley
AI4CE
LM&Ro
LLMAG
24
0
0
11 Apr 2024
Can Feedback Enhance Semantic Grounding in Large Vision-Language Models?
Yuan-Hong Liao
Rafid Mahmood
Sanja Fidler
David Acuna
VLM
44
7
0
09 Apr 2024
Draw-and-Understand: Leveraging Visual Prompts to Enable MLLMs to Comprehend What You Want
Weifeng Lin
Xinyu Wei
Ruichuan An
Peng Gao
Bocheng Zou
Yulin Luo
Siyuan Huang
Shanghang Zhang
Hongsheng Li
VLM
58
32
0
29 Mar 2024
LLaVA-PruMerge: Adaptive Token Reduction for Efficient Large Multimodal Models
Yuzhang Shang
Mu Cai
Bingxin Xu
Yong Jae Lee
Yan Yan
VLM
29
104
0
22 Mar 2024
Griffon v2: Advancing Multimodal Perception with High-Resolution Scaling and Visual-Language Co-Referring
Yufei Zhan
Yousong Zhu
Hongyin Zhao
Fan Yang
Ming Tang
Jinqiao Wang
ObjD
31
12
0
14 Mar 2024
Contrastive Region Guidance: Improving Grounding in Vision-Language Models without Training
David Wan
Jaemin Cho
Elias Stengel-Eskin
Mohit Bansal
VLM
ObjD
46
29
0
04 Mar 2024
A Unified Framework and Dataset for Assessing Societal Bias in Vision-Language Models
Ashutosh Sathe
Prachi Jain
Sunayana Sitaram
45
1
0
21 Feb 2024
CounterCurate: Enhancing Physical and Semantic Visio-Linguistic Compositional Reasoning via Counterfactual Examples
Jianrui Zhang
Mu Cai
Tengyang Xie
Yong Jae Lee
LRM
32
18
0
20 Feb 2024
A Touch, Vision, and Language Dataset for Multimodal Alignment
Letian Fu
Gaurav Datta
Huang Huang
Will Panitch
Jaimyn Drake
Joseph Ortiz
Mustafa Mukadam
Mike Lambeta
Roberto Calandra
Ken Goldberg
VLM
25
30
0
20 Feb 2024
CoLLaVO: Crayon Large Language and Vision mOdel
Byung-Kwan Lee
Beomchan Park
Chae Won Kim
Yonghyun Ro
VLM
MLLM
19
16
0
17 Feb 2024
PIVOT: Iterative Visual Prompting Elicits Actionable Knowledge for VLMs
Soroush Nasiriany
Fei Xia
Wenhao Yu
Ted Xiao
Jacky Liang
...
Karol Hausman
N. Heess
Chelsea Finn
Sergey Levine
Brian Ichter
LM&Ro
LRM
23
90
0
12 Feb 2024
MEGAVERSE: Benchmarking Large Language Models Across Languages, Modalities, Models and Tasks
Sanchit Ahuja
Divyanshu Aggarwal
Varun Gumma
Ishaan Watts
Ashutosh Sathe
...
Rishav Hada
Prachi Jain
Maxamed Axmed
Kalika Bali
Sunayana Sitaram
ELM
27
39
0
13 Nov 2023
MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning
Jun Chen
Deyao Zhu
Xiaoqian Shen
Xiang Li
Zechun Liu
Pengchuan Zhang
Raghuraman Krishnamoorthi
Vikas Chandra
Yunyang Xiong
Mohamed Elhoseiny
MLLM
160
440
0
14 Oct 2023
From CLIP to DINO: Visual Encoders Shout in Multi-modal Large Language Models
Dongsheng Jiang
Yuchen Liu
Songlin Liu
Jiné Zhao
Hao Zhang
Zhen Gao
Xiaopeng Zhang
Jin Li
Hongkai Xiong
MLLM
VLM
31
34
0
13 Oct 2023
Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering
Pan Lu
Swaroop Mishra
Tony Xia
Liang Qiu
Kai-Wei Chang
Song-Chun Zhu
Oyvind Tafjord
Peter Clark
A. Kalyan
ELM
ReLM
LRM
207
1,101
0
20 Sep 2022
Previous
1
2