Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2303.08128
Cited By
ViperGPT: Visual Inference via Python Execution for Reasoning
14 March 2023
Dídac Surís
Sachit Menon
Carl Vondrick
MLLM
LRM
ReLM
Re-assign community
ArXiv
PDF
HTML
Papers citing
"ViperGPT: Visual Inference via Python Execution for Reasoning"
50 / 339 papers shown
Title
Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision, Language, Audio, and Action
Jiasen Lu
Christopher Clark
Sangho Lee
Zichen Zhang
Savya Khosla
Ryan Marten
Derek Hoiem
Aniruddha Kembhavi
VLM
MLLM
27
143
0
28 Dec 2023
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
Zhe Chen
Jiannan Wu
Wenhai Wang
Weijie Su
Guo Chen
...
Bin Li
Ping Luo
Tong Lu
Yu Qiao
Jifeng Dai
VLM
MLLM
156
895
0
21 Dec 2023
ASSISTGUI: Task-Oriented Desktop Graphical User Interface Automation
Difei Gao
Lei Ji
Zechen Bai
Mingyu Ouyang
Peiran Li
...
Peiyi Wang
Xiangwu Guo
Hengxu Wang
Luowei Zhou
Mike Zheng Shou
LLMAG
12
21
0
20 Dec 2023
Designing LLM Chains by Adapting Techniques from Crowdsourcing Workflows
Madeleine Grunde-McLaughlin
Michelle S. Lam
Ranjay Krishna
Daniel S. Weld
Jeffrey Heer
AI4CE
45
20
0
18 Dec 2023
CLOVA: A Closed-Loop Visual Assistant with Tool Usage and Update
Zhi Gao
Yuntao Du
Xintong Zhang
Xiaojian Ma
Wenjuan Han
Song-Chun Zhu
Qing Li
LLMAG
VLM
26
21
0
18 Dec 2023
Iterative Motion Editing with Natural Language
Purvi Goel
Kuan-Chieh Wang
C. Karen Liu
Kayvon Fatahalian
DiffM
22
22
0
15 Dec 2023
InstructPipe: Generating Visual Blocks Pipelines with Human Instructions and LLMs
Zhongyi Zhou
Jing Jin
Vrushank Phadnis
Xiuxiu Yuan
Jun Jiang
...
A. Olwal
David Kim
Ram Iyengar
Na Li
Ruofei Du
25
5
0
15 Dec 2023
DriveMLM: Aligning Multi-Modal Large Language Models with Behavioral Planning States for Autonomous Driving
Wenhai Wang
Jiangwei Xie
ChuanYang Hu
Haoming Zou
Jianan Fan
...
Lewei Lu
Xizhou Zhu
Xiaogang Wang
Yu Qiao
Jifeng Dai
34
122
0
14 Dec 2023
Remote Sensing Vision-Language Foundation Models without Annotations via Ground Remote Alignment
Utkarsh Mall
Cheng Perng Phoo
Meilin Kelsey Liu
Carl Vondrick
B. Hariharan
Kavita Bala
VLM
12
36
0
12 Dec 2023
Vista-LLaMA: Reducing Hallucination in Video Language Models via Equal Distance to Visual Tokens
Fan Ma
Xiaojie Jin
Heng Wang
Yuchen Xian
Jiashi Feng
Yi Yang
13
47
0
12 Dec 2023
AVA: Towards Autonomous Visualization Agents through Visual Perception-Driven Decision-Making
Shusen Liu
Haichao Miao
Zhimin Li
M. Olson
Valerio Pascucci
P. Bremer
22
8
0
07 Dec 2023
Chain of Code: Reasoning with a Language Model-Augmented Code Emulator
Chengshu Li
Jacky Liang
Andy Zeng
Xinyun Chen
Karol Hausman
Dorsa Sadigh
Sergey Levine
Fei-Fei Li
Fei Xia
Brian Ichter
LLMAG
LRM
31
70
0
07 Dec 2023
Fortify the Shortest Stave in Attention: Enhancing Context Awareness of Large Language Models for Effective Tool Use
Yuhan Chen
Ang Lv
Ting-En Lin
C. Chen
Yuchuan Wu
Fei Huang
Yongbin Li
Rui Yan
21
24
0
07 Dec 2023
LaMPilot: An Open Benchmark Dataset for Autonomous Driving with Language Model Programs
Yunsheng Ma
Can Cui
Xu Cao
Wenqian Ye
Peiran Liu
...
Rohit Gupta
Kyungtae Han
Aniket Bera
James M. Rehg
Ziran Wang
21
42
0
07 Dec 2023
Visual Program Distillation: Distilling Tools and Programmatic Reasoning into Vision-Language Models
Yushi Hu
Otilia Stretcu
Chun-Ta Lu
Krishnamurthy Viswanathan
Kenji Hata
Enming Luo
Ranjay Krishna
Ariel Fuxman
VLM
LRM
MLLM
32
26
0
05 Dec 2023
Recursive Visual Programming
Jiaxin Ge
Sanjay Subramanian
Baifeng Shi
Roei Herzig
Trevor Darrell
27
4
0
04 Dec 2023
Video Summarization: Towards Entity-Aware Captions
Hammad A. Ayyubi
Tianqi Liu
Arsha Nagrani
Xudong Lin
Mingda Zhang
Anurag Arnab
Feng Han
Yukun Zhu
Jialu Liu
Shih-Fu Chang
26
1
0
01 Dec 2023
Zero-Shot Video Question Answering with Procedural Programs
Rohan Choudhury
Koichiro Niinuma
Kris M. Kitani
László A. Jeni
19
21
0
01 Dec 2023
VIoTGPT: Learning to Schedule Vision Tools towards Intelligent Video Internet of Things
Yaoyao Zhong
Mengshi Qi
Rui Wang
Yuhan Qiu
Yang Zhang
Huadong Ma
13
2
0
01 Dec 2023
Leveraging VLM-Based Pipelines to Annotate 3D Objects
Rishabh Kabra
Loic Matthey
Alexander Lerchner
Niloy J. Mitra
8
6
0
29 Nov 2023
Compositional Chain-of-Thought Prompting for Large Multimodal Models
Chancharik Mitra
Brandon Huang
Trevor Darrell
Roei Herzig
MLLM
LRM
26
80
0
27 Nov 2023
Visual Programming for Zero-shot Open-Vocabulary 3D Visual Grounding
Zhihao Yuan
Jinke Ren
Chun-Mei Feng
Hengshuang Zhao
Shuguang Cui
Zhen Li
19
26
0
26 Nov 2023
Vamos: Versatile Action Models for Video Understanding
Shijie Wang
Qi Zhao
Minh Quan Do
Nakul Agarwal
Kwonjoon Lee
Chen Sun
27
19
0
22 Nov 2023
ViStruct: Visual Structural Knowledge Extraction via Curriculum Guided Code-Vision Representation
Yangyi Chen
Xingyao Wang
Manling Li
Derek Hoiem
Heng Ji
25
10
0
22 Nov 2023
GAIA: a benchmark for General AI Assistants
Grégoire Mialon
Clémentine Fourrier
Craig Swift
Thomas Wolf
Yann LeCun
Thomas Scialom
AI4MH
ALM
ELM
RALM
15
137
0
21 Nov 2023
De-fine: Decomposing and Refining Visual Programs with Auto-Feedback
Minghe Gao
Juncheng Li
Hao Fei
Liang Pang
Wei Ji
Guoming Wang
Wenqiao Zhang
Siliang Tang
Yueting Zhuang
16
8
0
21 Nov 2023
Igniting Language Intelligence: The Hitchhiker's Guide From Chain-of-Thought Reasoning to Language Agents
Zhuosheng Zhang
Yao Yao
Aston Zhang
Xiangru Tang
Xinbei Ma
...
Yiming Wang
Mark B. Gerstein
Rui Wang
Gongshen Liu
Hai Zhao
LLMAG
LM&Ro
LRM
31
51
0
20 Nov 2023
SelfEval: Leveraging the discriminative nature of generative models for evaluation
Sai Saketh Rambhatla
Ishan Misra
EGVM
25
4
0
17 Nov 2023
Video-LLaVA: Learning United Visual Representation by Alignment Before Projection
Bin Lin
Yang Ye
Bin Zhu
Jiaxi Cui
Munan Ning
Peng Jin
Li-ming Yuan
VLM
MLLM
194
576
0
16 Nov 2023
Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding
Peng Jin
Ryuichi Takanobu
Caiwan Zhang
Xiaochun Cao
Li-ming Yuan
MLLM
32
217
0
14 Nov 2023
GPT-4V in Wonderland: Large Multimodal Models for Zero-Shot Smartphone GUI Navigation
An Yan
Zhengyuan Yang
Wanrong Zhu
K. Lin
Linjie Li
...
Yiwu Zhong
Julian McAuley
Jianfeng Gao
Zicheng Liu
Lijuan Wang
LLMAG
LM&Ro
14
100
0
13 Nov 2023
Past as a Guide: Leveraging Retrospective Learning for Python Code Completion
Seunggyoon Shin
Seunggyu Chang
Sungjoon Choi
KELM
32
1
0
13 Nov 2023
PerceptionGPT: Effectively Fusing Visual Perception into LLM
Renjie Pi
Lewei Yao
Jiahui Gao
Jipeng Zhang
Tong Zhang
MLLM
18
30
0
11 Nov 2023
Analyzing Modular Approaches for Visual Question Decomposition
Apoorv Khandelwal
Ellie Pavlick
Chen Sun
35
4
0
10 Nov 2023
How to Bridge the Gap between Modalities: Survey on Multimodal Large Language Model
Shezheng Song
Xiaopeng Li
Shasha Li
Shan Zhao
Jie Yu
Jun Ma
Xiaoguang Mao
Weimin Zhang
66
3
0
10 Nov 2023
Follow-Up Differential Descriptions: Language Models Resolve Ambiguities for Image Classification
Reza Esfandiarpoor
Stephen H. Bach
VLM
19
13
0
10 Nov 2023
LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents
Shilong Liu
Hao Cheng
Haotian Liu
Hao Zhang
Feng Li
...
Hang Su
Jun Zhu
Lei Zhang
Jianfeng Gao
Chun-yue Li
MLLM
VLM
47
102
0
09 Nov 2023
GENOME: GenerativE Neuro-symbOlic visual reasoning by growing and reusing ModulEs
Zhenfang Chen
Rui Sun
Wenjun Liu
Yining Hong
Chuang Gan
LRM
21
14
0
08 Nov 2023
RoboGen: Towards Unleashing Infinite Data for Automated Robot Learning via Generative Simulation
Yufei Wang
Zhou Xian
Feng Chen
Tsun-Hsuan Wang
Yian Wang
Katerina Fragkiadaki
Zackory M. Erickson
David Held
Chuang Gan
LM&Ro
30
94
0
02 Nov 2023
LLaVA-Interactive: An All-in-One Demo for Image Chat, Segmentation, Generation and Editing
Wei-Ge Chen
Irina Spiridonova
Jianwei Yang
Jianfeng Gao
Chun-yue Li
MLLM
VLM
11
33
0
01 Nov 2023
Symbolic Planning and Code Generation for Grounded Dialogue
Justin T. Chiu
Wenting Zhao
Derek Chen
Saujas Vaduguru
Alexander M. Rush
Daniel Fried
LLMAG
8
7
0
26 Oct 2023
WebWISE: Web Interface Control and Sequential Exploration with Large Language Models
Heyi Tao
TV Sethuraman
Michal Shlapentokh-Rothman
Derek Hoiem
LLMAG
48
4
0
24 Oct 2023
What's Left? Concept Grounding with Logic-Enhanced Foundation Models
Joy Hsu
Jiayuan Mao
Joshua B. Tenenbaum
Jiajun Wu
VLM
ReLM
LRM
18
21
0
24 Oct 2023
Towards Perceiving Small Visual Details in Zero-shot Visual Question Answering with Multimodal LLMs
Jiarui Zhang
Mahyar Khayatkhoei
P. Chhikara
Filip Ilievski
13
2
0
24 Oct 2023
Large Language Models are Visual Reasoning Coordinators
Liangyu Chen
Bo Li
Sheng Shen
Jingkang Yang
Chunyuan Li
Kurt Keutzer
Trevor Darrell
Ziwei Liu
VLM
LRM
34
46
0
23 Oct 2023
Open-Ended Instructable Embodied Agents with Memory-Augmented Large Language Models
Gabriel H. Sarch
Yue Wu
Michael J. Tarr
Katerina Fragkiadaki
LM&Ro
LLMAG
11
18
0
23 Oct 2023
API-Assisted Code Generation for Question Answering on Varied Table Structures
Yihan Cao
Shuyi Chen
Ryan Liu
Zhiruo Wang
Daniel Fried
LMTD
14
10
0
23 Oct 2023
MoqaGPT : Zero-Shot Multi-modal Open-domain Question Answering with Large Language Model
Le Zhang
Yihong Wu
Fengran Mo
Jian-Yun Nie
Aishwarya Agrawal
MLLM
RALM
27
6
0
20 Oct 2023
Steve-Eye: Equipping LLM-based Embodied Agents with Visual Perception in Open Worlds
Sipeng Zheng
Jiazheng Liu
Yicheng Feng
Zongqing Lu
29
29
0
20 Oct 2023
Frozen Transformers in Language Models Are Effective Visual Encoder Layers
Ziqi Pang
Ziyang Xie
Yunze Man
Yu-xiong Wang
38
25
0
19 Oct 2023
Previous
1
2
3
4
5
6
7
Next