Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2311.05437
Cited By
LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents
9 November 2023
Shilong Liu
Hao Cheng
Haotian Liu
Hao Zhang
Feng Li
Tianhe Ren
Xueyan Zou
Jianwei Yang
Hang Su
Jun Zhu
Lei Zhang
Jianfeng Gao
Chun-yue Li
MLLM
VLM
Re-assign community
ArXiv
PDF
HTML
Papers citing
"LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents"
50 / 83 papers shown
Title
NGENT: Next-Generation AI Agents Must Integrate Multi-Domain Abilities to Achieve Artificial General Intelligence
Zhicong Li
Hangyu Mao
Jiangjin Yin
Mingzhe Xing
Zhiwei Xu
Yuanxing Zhang
Yang Xiao
29
0
0
30 Apr 2025
Streetscape Analysis with Generative AI (SAGAI): Vision-Language Assessment and Mapping of Urban Scenes
Joan Perez
Giovanni Fusco
20
0
0
23 Apr 2025
Generative Multimodal Pretraining with Discrete Diffusion Timestep Tokens
Kaihang Pan
Wang Lin
Zhongqi Yue
Tenglong Ao
Liyu Jia
Wei Zhao
Juncheng Billy Li
Siliang Tang
Hanwang Zhang
35
1
0
20 Apr 2025
Multimodal Agricultural Agent Architecture (MA3): A New Paradigm for Intelligent Agricultural Decision-Making
Zhuoning Xu
Jian Xu
M. Zhang
P. Wang
Chao Deng
Cheng-Lin Liu
26
0
0
07 Apr 2025
TARAC: Mitigating Hallucination in LVLMs via Temporal Attention Real-time Accumulative Connection
C. Xie
Tongxuan Liu
Lei Jiang
Yuting Zeng
J. Guo
Yunheng Shen
Weizhe Huang
Jing Li
Xiaohua Xu
VLM
53
0
0
05 Apr 2025
Stochastic Optimization with Optimal Importance Sampling
Liviu Aolaritei
Bart P. G. Van Parys
H. Lam
Michael I. Jordan
33
0
0
04 Apr 2025
QG-VTC: Question-Guided Visual Token Compression in MLLMs for Efficient VQA
Shuai Li
Jian Xu
Xiao-Hui Li
Chao Deng
Lin-Lin Huang
MQ
41
0
0
01 Apr 2025
Logic-in-Frames: Dynamic Keyframe Search via Visual Semantic-Logical Verification for Long Video Understanding
Weiyu Guo
Ziyang Chen
Shaoguang Wang
JianXiang He
Yijie Xu
Jinhui Ye
Ying Sun
Hui Xiong
42
1
0
17 Mar 2025
How Do Multimodal Large Language Models Handle Complex Multimodal Reasoning? Placing Them in An Extensible Escape Game
Z. Wang
Yurui Dong
Fuwen Luo
Minyuan Ruan
Zhili Cheng
C. L. P. Chen
Peng Li
Yang Liu
LRM
79
0
0
13 Mar 2025
REF-VLM: Triplet-Based Referring Paradigm for Unified Visual Decoding
Yan Tai
Luhao Zhu
Zhiqiang Chen
Ynan Ding
Yiying Dong
Xiaohong Liu
Guodong Guo
MLLM
ObjD
47
0
0
10 Mar 2025
OWLViz: An Open-World Benchmark for Visual Question Answering
T. Nguyen
Dang Nguyen
Hoang Nguyen
Thuan Luong
Long Hoang Dang
Viet Dac Lai
VLM
61
0
0
04 Mar 2025
RoboBrain: A Unified Brain Model for Robotic Manipulation from Abstract to Concrete
Yuheng Ji
Huajie Tan
Jiayu Shi
Xiaoshuai Hao
Yuan Zhang
...
Huaihai Lyu
Xiaolong Zheng
Jiaming Liu
Zhongyuan Wang
Shanghang Zhang
80
5
0
28 Feb 2025
MLLMs Know Where to Look: Training-free Perception of Small Visual Details with Multimodal LLMs
Jiarui Zhang
Mahyar Khayatkhoei
P. Chhikara
Filip Ilievski
LRM
39
5
0
24 Feb 2025
Image Embedding Sampling Method for Diverse Captioning
Sania Waheed
Na Min An
52
0
0
14 Feb 2025
PeFoMed: Parameter Efficient Fine-tuning of Multimodal Large Language Models for Medical Imaging
Gang Liu
Jinlong He
Pengfei Li
Genrong He
Zixu Zhao
Shenjun Zhong
LM&MA
65
2
0
17 Jan 2025
Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos
Haobo Yuan
X. Li
Tao Zhang
Zilong Huang
Shilin Xu
S. Ji
Yunhai Tong
Lu Qi
Jiashi Feng
Ming Yang
VLM
82
11
0
07 Jan 2025
VisionLLM v2: An End-to-End Generalist Multimodal Large Language Model for Hundreds of Vision-Language Tasks
Jiannan Wu
Muyan Zhong
Sen Xing
Zeqiang Lai
Zhaoyang Liu
...
Lewei Lu
Tong Lu
Ping Luo
Yu Qiao
Jifeng Dai
MLLM
VLM
LRM
88
45
0
03 Jan 2025
Vitron: A Unified Pixel-level Vision LLM for Understanding, Generating, Segmenting, Editing
Hao Fei
Shengqiong Wu
H. Zhang
Tat-Seng Chua
Shuicheng Yan
56
35
0
31 Dec 2024
CATP-LLM: Empowering Large Language Models for Cost-Aware Tool Planning
Duo Wu
J. Wang
Yuan Meng
Yanning Zhang
Le Sun
Zhi Wang
96
0
0
25 Nov 2024
OSCAR: Operating System Control via State-Aware Reasoning and Re-Planning
Xiaoqiang Wang
Bang Liu
LLMAG
LM&Ro
LRM
31
6
0
24 Oct 2024
Coherence-Driven Multimodal Safety Dialogue with Active Learning for Embodied Agents
Sabit Hassan
Hye-Young Chung
Xiang Zhi Tan
Malihe Alikhani
34
0
0
18 Oct 2024
A Survey on Data Synthesis and Augmentation for Large Language Models
Ke Wang
Jiahui Zhu
Minjie Ren
Z. Liu
Shiwei Li
...
Chenkai Zhang
Xiaoyu Wu
Qiqi Zhan
Qingjie Liu
Yunhong Wang
SyDa
36
15
0
16 Oct 2024
From Generalist to Specialist: Adapting Vision Language Models via Task-Specific Visual Instruction Tuning
Yang Bai
Yang Zhou
Jun Zhou
Rick Siow Mong Goh
Daniel Ting
Yong Liu
VLM
44
0
0
09 Oct 2024
An X-Ray Is Worth 15 Features: Sparse Autoencoders for Interpretable Radiology Report Generation
Ahmed Abdulaal
Hugo Fry
Nina Montaña-Brown
Ayodeji Ijishakin
Jack Gao
Stephanie L. Hyland
Daniel C. Alexander
Daniel Coelho De Castro
MedIm
31
7
0
04 Oct 2024
Visual Editing with LLM-based Tool Chaining: An Efficient Distillation Approach for Real-Time Applications
Oren Sultan
Alex Khasin
Guy Shiran
Asnat Greenstein-Messica
Dafna Shahaf
16
0
0
03 Oct 2024
SURf: Teaching Large Vision-Language Models to Selectively Utilize Retrieved Information
Jiashuo Sun
Jihai Zhang
Yucheng Zhou
Zhaochen Su
Xiaoye Qu
Yu Cheng
37
10
0
21 Sep 2024
Mitigating Hallucination in Visual-Language Models via Re-Balancing Contrastive Decoding
Xiaoyu Liang
Jiayuan Yu
Lianrui Mu
Jiedong Zhuang
Jiaqi Hu
Yuchen Yang
Jiangnan Ye
Lu Lu
Jian Chen
Haoji Hu
VLM
32
0
0
10 Sep 2024
Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model
Chunting Zhou
Lili Yu
Arun Babu
Kushal Tirumala
Michihiro Yasunaga
Leonid Shamis
Jacob Kahn
Xuezhe Ma
Luke Zettlemoyer
Omer Levy
DiffM
23
145
0
20 Aug 2024
UnifiedMLLM: Enabling Unified Representation for Multi-modal Multi-tasks With Large Language Model
Zhaowei Li
Wei Wang
Yiqing Cai
Xu Qi
Pengyu Wang
Dong Zhang
Hang Song
Botian Jiang
Zhida Huang
Tao Wang
AIFin
LRM
32
3
0
05 Aug 2024
AppAgent v2: Advanced Agent for Flexible Mobile Interactions
Yanda Li
Chi Zhang
Wanqi Yang
Bin-Bin Fu
Pei Cheng
Xin Chen
Ling Chen
Yunchao Wei
LLMAG
LM&Ro
23
9
0
05 Aug 2024
ViLLa: Video Reasoning Segmentation with Large Language Model
Rongkun Zheng
Lu Qi
Xi Chen
Yi Wang
Kun Wang
Yu Qiao
Hengshuang Zhao
VOS
LRM
45
2
0
18 Jul 2024
GeNet: A Multimodal LLM-Based Co-Pilot for Network Topology and Configuration
Beni Ifland
Elad Duani
Rubin Krief
Miro Ohana
Aviram Zilberman
...
Ortal Lavi
Hikichi Kenji
A. Shabtai
Yuval Elovici
Rami Puzis
15
3
0
11 Jul 2024
MMedAgent: Learning to Use Medical Tools with Multi-modal Agent
Binxu Li
Tiankai Yan
Yuanting Pan
Zhe Xu
Jie Luo
Ruiyang Ji
Shilong Liu
Haoyu Dong
Zihao Lin
Yixin Wang
LM&MA
31
24
0
02 Jul 2024
CRAB: Cross-environment Agent Benchmark for Multimodal Language Model Agents
Tianqi Xu
Linyao Chen
Dai-Jie Wu
Yanjun Chen
Zecheng Zhang
...
Shilong Liu
Bochen Qian
Philip H. S. Torr
Bernard Ghanem
G. Li
35
14
0
01 Jul 2024
Investigating and Mitigating the Multimodal Hallucination Snowballing in Large Vision-Language Models
Weihong Zhong
Xiaocheng Feng
Liang Zhao
Qiming Li
Lei Huang
Yuxuan Gu
Weitao Ma
Yuan Xu
Bing Qin
MLLM
36
9
0
30 Jun 2024
OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and Understanding
Tao Zhang
Xiangtai Li
Hao Fei
Haobo Yuan
Shengqiong Wu
Shunping Ji
Chen Change Loy
Shuicheng Yan
LRM
MLLM
VLM
47
44
0
27 Jun 2024
GUICourse: From General Vision Language Models to Versatile GUI Agents
Wentong Chen
Junbo Cui
Jinyi Hu
Yujia Qin
Junjie Fang
...
Yupeng Huo
Yuan Yao
Yankai Lin
Zhiyuan Liu
Maosong Sun
LLMAG
26
31
0
17 Jun 2024
From Pixels to Prose: A Large Dataset of Dense Image Captions
Vasu Singla
Kaiyu Yue
Sukriti Paul
Reza Shirkavand
Mayuka Jayawardhana
Alireza Ganjdanesh
Heng Huang
A. Bhatele
Gowthami Somepalli
Tom Goldstein
3DV
VLM
20
22
0
14 Jun 2024
Visual Sketchpad: Sketching as a Visual Chain of Thought for Multimodal Language Models
Yushi Hu
Weijia Shi
Xingyu Fu
Dan Roth
Mari Ostendorf
Luke Zettlemoyer
Noah A. Smith
Ranjay Krishna
LRM
32
34
0
13 Jun 2024
Needle In A Video Haystack: A Scalable Synthetic Evaluator for Video MLLMs
Zijia Zhao
Haoyu Lu
Yuqi Huo
Yifan Du
Tongtian Yue
Longteng Guo
Bingning Wang
Weipeng Chen
Jing Liu
28
2
0
13 Jun 2024
Wings: Learning Multimodal LLMs without Text-only Forgetting
Yi-Kai Zhang
Shiyin Lu
Yang Li
Yanqing Ma
Qing-Guo Chen
Zhao Xu
Weihua Luo
Kaifu Zhang
De-Chuan Zhan
Han-Jia Ye
VLM
33
6
0
05 Jun 2024
MotionLLM: Understanding Human Behaviors from Human Motions and Videos
Ling-Hao Chen
Shunlin Lu
Ailing Zeng
Hao Zhang
Benyou Wang
Ruimao Zhang
Lei Zhang
45
33
0
30 May 2024
Typography Leads Semantic Diversifying: Amplifying Adversarial Transferability across Multimodal Large Language Models
Hao-Ran Cheng
Erjia Xiao
Jiahang Cao
Le Yang
Kaidi Xu
Jindong Gu
Renjing Xu
AAML
50
7
0
30 May 2024
A Human-Like Reasoning Framework for Multi-Phases Planning Task with Large Language Models
Chengxing Xie
Difan Zou
LRM
LLMAG
27
4
0
28 May 2024
A Misleading Gallery of Fluid Motion by Generative Artificial Intelligence
Ali Kashefi
VGen
40
5
0
24 May 2024
IM-RAG: Multi-Round Retrieval-Augmented Generation Through Learning Inner Monologues
Diji Yang
Jinmeng Rao
Kezhen Chen
Xiaoyuan Guo
Yawen Zhang
Jie Yang
Yi Zhang
LRM
RALM
37
4
0
15 May 2024
VS-Assistant: Versatile Surgery Assistant on the Demand of Surgeons
Zhen Chen
Xingjian Luo
Jinlin Wu
Danny Tat Ming Chan
Zhen Lei
Jinqiao Wang
Sebastien Ourselin
Hongbin Liu
21
4
0
14 May 2024
DoLLM: How Large Language Models Understanding Network Flow Data to Detect Carpet Bombing DDoS
Qingyang Li
Yihang Zhang
Zhidong Jia
Yannan Hu
Lei Zhang
Jianrong Zhang
Yongming Xu
Yong Cui
Zongming Guo
Xinggong Zhang
AI4CE
29
6
0
13 May 2024
ChatHuman: Language-driven 3D Human Understanding with Retrieval-Augmented Tool Reasoning
Jing Lin
Yao Feng
Weiyang Liu
Michael J. Black
3DH
LRM
34
5
0
07 May 2024
BattleAgent: Multi-modal Dynamic Emulation on Historical Battles to Complement Historical Analysis
Shuhang Lin
Wenyue Hua
Lingyao Li
Che-Jui Chang
Lizhou Fan
Jianchao Ji
Hang Hua
Mingyu Jin
Jiebo Luo
Yongfeng Zhang
LM&Ro
LLMAG
46
8
0
23 Apr 2024
1
2
Next