Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2303.04671
Cited By
Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models
8 March 2023
Chenfei Wu
Sheng-Kai Yin
Weizhen Qi
Xiaodong Wang
Zecheng Tang
Nan Duan
MLLM
LRM
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models"
50 / 75 papers shown
Title
TUMS: Enhancing Tool-use Abilities of LLMs with Multi-structure Handlers
Aiyao He
Sijia Cui
Shuai Xu
Yanna Wang
Bo Xu
24
0
0
13 May 2025
"I Can See Forever!": Evaluating Real-time VideoLLMs for Assisting Individuals with Visual Impairments
Z. Zhang
Zhen Sun
Z. Zhang
Zifan Peng
Yuemeng Zhao
Z. Wang
Zeren Luo
Ruiting Zuo
Xinlei He
38
0
0
07 May 2025
A Comprehensive Survey of Foundation Models in Medicine
Wasif Khan
Seowung Leem
Kyle B. See
Joshua K. Wong
Shaoting Zhang
R. Fang
AI4CE
LM&MA
VLM
95
16
0
17 Jan 2025
VisionLLM v2: An End-to-End Generalist Multimodal Large Language Model for Hundreds of Vision-Language Tasks
Jiannan Wu
Muyan Zhong
Sen Xing
Zeqiang Lai
Zhaoyang Liu
...
Lewei Lu
Tong Lu
Ping Luo
Yu Qiao
Jifeng Dai
MLLM
VLM
LRM
91
45
0
03 Jan 2025
Exposing Limitations of Language Model Agents in Sequential-Task Compositions on the Web
Hiroki Furuta
Yutaka Matsuo
Aleksandra Faust
Izzeddin Gur
CLL
80
13
0
03 Jan 2025
In-Context Learning with Iterative Demonstration Selection
Chengwei Qin
Aston Zhang
C. L. P. Chen
Anirudh Dagar
Wenming Ye
LRM
60
38
0
31 Dec 2024
Do Language Models Understand Time?
Xi Ding
Lei Wang
162
0
0
18 Dec 2024
Olympus: A Universal Task Router for Computer Vision Tasks
Yuanze Lin
Yunsheng Li
Dongdong Chen
Weijian Xu
Ronald Clark
Philip H. S. Torr
VLM
ObjD
114
0
0
12 Dec 2024
CATP-LLM: Empowering Large Language Models for Cost-Aware Tool Planning
Duo Wu
J. Wang
Yuan Meng
Yanning Zhang
Le Sun
Zhi Wang
102
0
0
25 Nov 2024
Spider: Any-to-Many Multimodal LLM
Jinxiang Lai
Jie Zhang
Jun Liu
Jian Li
Xiaocheng Lu
Song Guo
MLLM
52
2
0
14 Nov 2024
TransAgent: Transfer Vision-Language Foundation Models with Heterogeneous Agent Collaboration
Yiwei Guo
Shaobin Zhuang
Kunchang Li
Yu Qiao
Yali Wang
VLM
CLIP
21
0
0
16 Oct 2024
Agent-Oriented Planning in Multi-Agent Systems
Ao Li
Yuexiang Xie
Songze Li
Fugee Tsung
Bolin Ding
Yaliang Li
AIFin
58
5
0
03 Oct 2024
VLEU: a Method for Automatic Evaluation for Generalizability of Text-to-Image Models
Jingtao Cao
Zheng Zhang
Hongru Wang
Kam-Fai Wong
22
0
0
23 Sep 2024
VideoVista: A Versatile Benchmark for Video Understanding and Reasoning
Yunxin Li
Xinyu Chen
Baotian Hu
Longyue Wang
Haoyuan Shi
Min-Ling Zhang
MLLM
LRM
38
25
0
17 Jun 2024
VoCoT: Unleashing Visually Grounded Multi-Step Reasoning in Large Multi-Modal Models
Zejun Li
Ruipu Luo
Jiwen Zhang
Minghui Qiu
Zhongyu Wei
Zhongyu Wei
LRM
MLLM
52
6
0
27 May 2024
A Survey on Vision-Language-Action Models for Embodied AI
Yueen Ma
Zixing Song
Yuzheng Zhuang
Jianye Hao
Irwin King
LM&Ro
64
38
0
23 May 2024
MM-Retinal: Knowledge-Enhanced Foundational Pretraining with Fundus Image-Text Expertise
Ruiqi Wu
Chenran Zhang
Jianle Zhang
Yi Zhou
Tao Zhou
Huazhu Fu
19
8
0
20 May 2024
Uni-MoE: Scaling Unified Multimodal LLMs with Mixture of Experts
Yunxin Li
Shenyuan Jiang
Baotian Hu
Longyue Wang
Wanqi Zhong
Wenhan Luo
Lin Ma
Min-Ling Zhang
MoE
30
27
0
18 May 2024
ReasonPix2Pix: Instruction Reasoning Dataset for Advanced Image Editing
Ying Jin
Pengyang Ling
Xiao-wen Dong
Pan Zhang
Jiaqi Wang
Dahua Lin
24
2
0
18 May 2024
Realizing Visual Question Answering for Education: GPT-4V as a Multimodal AI
Gyeong-Geon Lee
Xiaoming Zhai
27
4
0
12 May 2024
From Matching to Generation: A Survey on Generative Information Retrieval
Xiaoxi Li
Jiajie Jin
Yujia Zhou
Yuyao Zhang
Peitian Zhang
Yutao Zhu
Zhicheng Dou
3DV
64
36
0
23 Apr 2024
HOI-Ref: Hand-Object Interaction Referral in Egocentric Vision
Siddhant Bansal
Michael Wray
Dima Damen
31
3
0
15 Apr 2024
TempCompass: Do Video LLMs Really Understand Videos?
Yuanxin Liu
Shicheng Li
Yi Liu
Yuxiang Wang
Shuhuai Ren
Lei Li
Sishuo Chen
Xu Sun
Lu Hou
VLM
41
98
0
01 Mar 2024
Exploring the Potential of Large Language Models for Improving Digital Forensic Investigation Efficiency
Akila Wickramasekara
F. Breitinger
Mark Scanlon
42
7
0
29 Feb 2024
Unveiling Typographic Deceptions: Insights of the Typographic Vulnerability in Large Vision-Language Model
Hao-Ran Cheng
Erjia Xiao
Jindong Gu
Le Yang
Jinhao Duan
Jize Zhang
Jiahang Cao
Kaidi Xu
Renjing Xu
24
6
0
29 Feb 2024
From Summary to Action: Enhancing Large Language Models for Complex Tasks with Open World APIs
Yulong Liu
Yunlong Yuan
Chunwei Wang
Jianhua Han
Yongqiang Ma
Li Zhang
Nanning Zheng
Hang Xu
LLMAG
21
5
0
28 Feb 2024
AutoMMLab: Automatically Generating Deployable Models from Language Instructions for Computer Vision Tasks
Zekang Yang
Wang Zeng
Sheng Jin
Chao Qian
Ping Luo
Wentao Liu
MLLM
VLM
44
8
0
23 Feb 2024
Mobile-Agent: Autonomous Multi-Modal Mobile Device Agent with Visual Perception
Junyang Wang
Haiyang Xu
Jiabo Ye
Mingshi Yan
Weizhou Shen
Ji Zhang
Fei Huang
Jitao Sang
13
103
0
29 Jan 2024
MLLM-Tool: A Multimodal Large Language Model For Tool Agent Learning
Chenyu Wang
Weixin Luo
Qianyu Chen
Haonan Mai
Jindi Guo
Sixun Dong
Xiaohua Xuan
MLLM
LLMAG
41
17
0
19 Jan 2024
DoraemonGPT: Toward Understanding Dynamic Scenes with Large Language Models (Exemplified as A Video Agent)
Zongxin Yang
Guikun Chen
Xiaodi Li
Wenguan Wang
Yi Yang
LM&Ro
LLMAG
43
35
0
16 Jan 2024
LISA++: An Improved Baseline for Reasoning Segmentation with Large Language Model
Senqiao Yang
Tianyuan Qu
Xin Lai
Zhuotao Tian
Bohao Peng
Shu-Lin Liu
Jiaya Jia
VLM
21
28
0
28 Dec 2023
IQAGPT: Image Quality Assessment with Vision-language and ChatGPT Models
Zhihao Chen
Bin Hu
Chuang Niu
Tao Chen
Yuxin Li
Hongming Shan
Ge Wang
LM&MA
MLLM
19
4
0
25 Dec 2023
Jack of All Tasks, Master of Many: Designing General-purpose Coarse-to-Fine Vision-Language Model
Shraman Pramanick
Guangxing Han
Rui Hou
Sayan Nag
Ser-Nam Lim
Nicolas Ballas
Qifan Wang
Rama Chellappa
Amjad Almahairi
VLM
MLLM
38
29
0
19 Dec 2023
LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models
Yanwei Li
Chengyao Wang
Jiaya Jia
VLM
MLLM
26
259
0
28 Nov 2023
Visual Programming for Zero-shot Open-Vocabulary 3D Visual Grounding
Zhihao Yuan
Jinke Ren
Chun-Mei Feng
Hengshuang Zhao
Shuguang Cui
Zhen Li
19
26
0
26 Nov 2023
Boosting the Power of Small Multimodal Reasoning Models to Match Larger Models with Self-Consistency Training
Cheng Tan
Jingxuan Wei
Zhangyang Gao
Linzhuang Sun
Siyuan Li
Ruifeng Guo
Xihong Yang
Stan Z. Li
LRM
14
7
0
23 Nov 2023
MMC: Advancing Multimodal Chart Understanding with Large-scale Instruction Tuning
Fuxiao Liu
Xiaoyang Wang
Wenlin Yao
Jianshu Chen
Kaiqiang Song
Sangwoo Cho
Yaser Yacoob
Dong Yu
15
98
0
15 Nov 2023
LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents
Shilong Liu
Hao Cheng
Haotian Liu
Hao Zhang
Feng Li
...
Hang Su
Jun Zhu
Lei Zhang
Jianfeng Gao
Chun-yue Li
MLLM
VLM
47
102
0
09 Nov 2023
MoqaGPT : Zero-Shot Multi-modal Open-domain Question Answering with Large Language Model
Le Zhang
Yihong Wu
Fengran Mo
Jian-Yun Nie
Aishwarya Agrawal
MLLM
RALM
21
6
0
20 Oct 2023
ToolChain*: Efficient Action Space Navigation in Large Language Models with A* Search
Yuchen Zhuang
Xiang Chen
Tong Yu
Saayan Mitra
Victor S. Bursztyn
Ryan A. Rossi
Somdeb Sarkhel
Chao Zhang
LLMAG
24
52
0
20 Oct 2023
Towards Robust Multi-Modal Reasoning via Model Selection
Xiangyan Liu
Rongxue Li
Wei Ji
Tao Lin
LLMAG
LRM
22
3
0
12 Oct 2023
Toolink: Linking Toolkit Creation and Using through Chain-of-Solving on Open-Source Model
Cheng Qian
Chenyan Xiong
Zhenghao Liu
Zhiyuan Liu
LRM
24
12
0
08 Oct 2023
Large Language Model (LLM) as a System of Multiple Expert Agents: An Approach to solve the Abstraction and Reasoning Corpus (ARC) Challenge
J. Tan
Mehul Motani
LLMAG
26
8
0
08 Oct 2023
VisIT-Bench: A Benchmark for Vision-Language Instruction Following Inspired by Real-World Use
Yonatan Bitton
Hritik Bansal
Jack Hessel
Rulin Shao
Wanrong Zhu
Anas Awadalla
Josh Gardner
Rohan Taori
L. Schimdt
VLM
29
76
0
12 Aug 2023
Tool Documentation Enables Zero-Shot Tool-Usage with Large Language Models
Cheng-Yu Hsieh
Sibei Chen
Chun-Liang Li
Yasuhisa Fujii
Alexander Ratner
Chen-Yu Lee
Ranjay Krishna
Tomas Pfister
LLMAG
SyDa
27
40
0
01 Aug 2023
ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs
Yujia Qin
Shi Liang
Yining Ye
Kunlun Zhu
Lan Yan
...
Jie Zhou
Mark B. Gerstein
Dahai Li
Zhiyuan Liu
Maosong Sun
CLL
ALM
LLMAG
ELM
LM&MA
36
608
0
31 Jul 2023
Testing the Depth of ChatGPT's Comprehension via Cross-Modal Tasks Based on ASCII-Art: GPT3.5's Abilities in Regard to Recognizing and Generating ASCII-Art Are Not Totally Lacking
David Bayani
MLLM
24
5
0
28 Jul 2023
Foundational Models Defining a New Era in Vision: A Survey and Outlook
Muhammad Awais
Muzammal Naseer
Salman Khan
Rao Muhammad Anwer
Hisham Cholakkal
M. Shah
Ming Yang
F. Khan
VLM
13
116
0
25 Jul 2023
Fashion Matrix: Editing Photos by Just Talking
Zheng Chong
Xujie Zhang
Fuwei Zhao
Zhenyu Xie
Xiaodan Liang
DiffM
6
2
0
25 Jul 2023
BuboGPT: Enabling Visual Grounding in Multi-Modal LLMs
Yang Zhao
Zhijie Lin
Daquan Zhou
Zilong Huang
Jiashi Feng
Bingyi Kang
MLLM
26
106
0
17 Jul 2023
1
2
Next