Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2305.18565
Cited By
PaLI-X: On Scaling up a Multilingual Vision and Language Model
29 May 2023
Xi Chen
Josip Djolonga
Piotr Padlewski
Basil Mustafa
Soravit Changpinyo
Jialin Wu
Carlos Riquelme Ruiz
Sebastian Goodman
Xiao Wang
Yi Tay
Siamak Shakeri
Mostafa Dehghani
Daniel M. Salz
Mario Lucic
Michael Tschannen
Arsha Nagrani
Hexiang Hu
Mandar Joshi
Bo Pang
Ceslee Montgomery
Paulina Pietrzyk
Marvin Ritter
A. Piergiovanni
Matthias Minderer
Filip Pavetić
Austin Waters
Gang Li
Ibrahim M. Alabdulmohsin
Lucas Beyer
J. Amelot
Kenton Lee
Andreas Steiner
Yang Li
Daniel Keysers
Anurag Arnab
Yuanzhong Xu
Keran Rong
Alexander Kolesnikov
Mojtaba Seyedhosseini
A. Angelova
Xiaohua Zhai
N. Houlsby
Radu Soricut
VLM
Re-assign community
ArXiv
PDF
HTML
Papers citing
"PaLI-X: On Scaling up a Multilingual Vision and Language Model"
50 / 161 papers shown
Title
A Survey of Multimodal Large Language Model from A Data-centric Perspective
Tianyi Bai
Hao Liang
Binwang Wan
Yanran Xu
Xi Li
...
Ping-Chia Huang
Jiulong Shan
Conghui He
Binhang Yuan
Wentao Zhang
47
31
0
26 May 2024
Semantica: An Adaptable Image-Conditioned Diffusion Model
Manoj Kumar
N. Houlsby
Emiel Hoogeboom
DiffM
VLM
29
0
0
23 May 2024
DiM: Diffusion Mamba for Efficient High-Resolution Image Synthesis
Yao Teng
Yue Wu
Han Shi
Xuefei Ning
Guohao Dai
Yu-Xiang Wang
Zhenguo Li
Xihui Liu
Mamba
40
32
0
23 May 2024
A Survey on Vision-Language-Action Models for Embodied AI
Yueen Ma
Zixing Song
Yuzheng Zhuang
Jianye Hao
Irwin King
LM&Ro
60
38
0
23 May 2024
Libra: Building Decoupled Vision System on Large Language Models
Yifan Xu
Xiaoshan Yang
Y. Song
Changsheng Xu
MLLM
VLM
25
0
0
16 May 2024
What Foundation Models can Bring for Robot Learning in Manipulation : A Survey
Dingzhe Li
Yixiang Jin
A. Yong
Hongze Yu
Jun Shi
Xiaoshuai Hao
Peng Hao
Huaping Liu
Fuchun Sun
Bin Fang
AI4CE
LM&Ro
62
12
0
28 Apr 2024
AutoAD III: The Prequel -- Back to the Pixels
Tengda Han
Max Bain
Arsha Nagrani
Gül Varol
Weidi Xie
Andrew Zisserman
VGen
DiffM
36
20
0
22 Apr 2024
MoVA: Adapting Mixture of Vision Experts to Multimodal Context
Zhuofan Zong
Bingqi Ma
Dazhong Shen
Guanglu Song
Hao Shao
Dongzhi Jiang
Hongsheng Li
Yu Liu
MoE
37
40
0
19 Apr 2024
BLINK: Multimodal Large Language Models Can See but Not Perceive
Xingyu Fu
Yushi Hu
Bangzheng Li
Yu Feng
Haoyu Wang
Xudong Lin
Dan Roth
Noah A. Smith
Wei-Chiu Ma
Ranjay Krishna
VLM
LRM
MLLM
41
107
0
18 Apr 2024
BRAVE: Broadening the visual encoding of vision-language models
Ouguzhan Fatih Kar
A. Tonioni
Petra Poklukar
Achin Kulshrestha
Amir Zamir
Federico Tombari
MLLM
VLM
42
25
0
10 Apr 2024
MoReVQA: Exploring Modular Reasoning Models for Video Question Answering
Juhong Min
Shyamal Buch
Arsha Nagrani
Minsu Cho
Cordelia Schmid
LRM
34
20
0
09 Apr 2024
mChartQA: A universal benchmark for multimodal Chart Question Answer based on Vision-Language Alignment and Reasoning
Jingxuan Wei
Nan Xu
Guiyong Chang
Yin Luo
Bihui Yu
Ruifeng Guo
39
2
0
02 Apr 2024
Streaming Dense Video Captioning
Xingyi Zhou
Anurag Arnab
Shyamal Buch
Shen Yan
Austin Myers
Xuehan Xiong
Arsha Nagrani
Cordelia Schmid
VLM
23
30
0
01 Apr 2024
IsoBench: Benchmarking Multimodal Foundation Models on Isomorphic Representations
Deqing Fu
Ghazal Khalighinejad
Ollie Liu
Bhuwan Dhingra
Dani Yogatama
Robin Jia
W. Neiswanger
23
14
0
01 Apr 2024
MagicLens: Self-Supervised Image Retrieval with Open-Ended Instructions
Kai Zhang
Yi Luan
Hexiang Hu
Kenton Lee
Siyuan Qiao
Wenhu Chen
Yu-Chuan Su
Ming-Wei Chang
VLM
LRM
25
32
0
28 Mar 2024
LocCa: Visual Pretraining with Location-aware Captioners
Bo Wan
Michael Tschannen
Yongqin Xian
Filip Pavetić
Ibrahim M. Alabdulmohsin
Xiao Wang
André Susano Pinto
Andreas Steiner
Lucas Beyer
Xiao-Qi Zhai
VLM
35
5
0
28 Mar 2024
OmniParser: A Unified Framework for Text Spotting, Key Information Extraction and Table Recognition
Jianqiang Wan
Sibo Song
Wenwen Yu
Yuliang Liu
Wenqing Cheng
Fei Huang
Xiang Bai
Cong Yao
Zhibo Yang
34
26
0
28 Mar 2024
Synthesize Step-by-Step: Tools, Templates and LLMs as Data Generators for Reasoning-Based Chart VQA
Zhuowan Li
Bhavan A. Jasani
Peng Tang
Shabnam Ghadar
LRM
14
8
0
25 Mar 2024
Improved Baselines for Data-efficient Perceptual Augmentation of LLMs
Théophane Vallaeys
Mustafa Shukor
Matthieu Cord
Jakob Verbeek
52
12
0
20 Mar 2024
Chart-based Reasoning: Transferring Capabilities from LLMs to VLMs
Victor Carbune
Hassan Mansoor
Fangyu Liu
Rahul Aralikatte
Gilles Baechler
Jindong Chen
Abhanshu Sharma
ReLM
LRM
122
11
0
19 Mar 2024
MineDreamer: Learning to Follow Instructions via Chain-of-Imagination for Simulated-World Control
Enshen Zhou
Yiran Qin
Zhen-fei Yin
Yuzhou Huang
Ruimao Zhang
Lu Sheng
Yu Qiao
Jing Shao
LM&Ro
AI4CE
35
32
0
18 Mar 2024
MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training
Brandon McKinzie
Zhe Gan
J. Fauconnier
Sam Dodge
Bowen Zhang
...
Zirui Wang
Ruoming Pang
Peter Grasch
Alexander Toshev
Yinfei Yang
MLLM
27
185
0
14 Mar 2024
Griffon v2: Advancing Multimodal Perception with High-Resolution Scaling and Visual-Language Co-Referring
Yufei Zhan
Yousong Zhu
Hongyin Zhao
Fan Yang
Ming Tang
Jinqiao Wang
ObjD
31
12
0
14 Mar 2024
Modeling Collaborator: Enabling Subjective Vision Classification With Minimal Human Effort via LLM Tool-Use
Imad Eddine Toubal
Aditya Avinash
N. Alldrin
Jan Dlabal
Wenlei Zhou
...
Chun-Ta Lu
Howard Zhou
Ranjay Krishna
Ariel Fuxman
Tom Duerig
VLM
67
7
0
05 Mar 2024
Contrastive Region Guidance: Improving Grounding in Vision-Language Models without Training
David Wan
Jaemin Cho
Elias Stengel-Eskin
Mohit Bansal
VLM
ObjD
46
29
0
04 Mar 2024
RT-H: Action Hierarchies Using Language
Suneel Belkhale
Tianli Ding
Ted Xiao
P. Sermanet
Quon Vuong
Jonathan Tompson
Yevgen Chebotar
Debidatta Dwibedi
Dorsa Sadigh
LM&Ro
23
73
0
04 Mar 2024
Multimodal ArXiv: A Dataset for Improving Scientific Comprehension of Large Vision-Language Models
Lei Li
Yuqi Wang
Runxin Xu
Peiyi Wang
Xiachong Feng
Lingpeng Kong
Qi Liu
29
50
0
01 Mar 2024
ShapeLLM: Universal 3D Object Understanding for Embodied Interaction
Zekun Qi
Runpei Dong
Shaochen Zhang
Haoran Geng
Chunrui Han
Zheng Ge
Li Yi
Kaisheng Ma
33
49
0
27 Feb 2024
How Easy is It to Fool Your Multimodal LLMs? An Empirical Analysis on Deceptive Prompts
Yusu Qian
Haotian Zhang
Yinfei Yang
Zhe Gan
64
26
0
20 Feb 2024
PIVOT: Iterative Visual Prompting Elicits Actionable Knowledge for VLMs
Soroush Nasiriany
Fei Xia
Wenhao Yu
Ted Xiao
Jacky Liang
...
Karol Hausman
N. Heess
Chelsea Finn
Sergey Levine
Brian Ichter
LM&Ro
LRM
17
88
0
12 Feb 2024
InkSight: Offline-to-Online Handwriting Conversion by Learning to Read and Write
B. Mitrevski
Arina Rak
Julian Schnitzler
Chengkun Li
Andrii Maksai
Jesse Berent
C. Musat
DiffM
10
0
0
08 Feb 2024
Real-World Robot Applications of Foundation Models: A Review
Kento Kawaharazuka
T. Matsushima
Andrew Gambardella
Jiaxian Guo
Chris Paxton
Andy Zeng
OffRL
VLM
LM&Ro
38
45
0
08 Feb 2024
Scaling Up LLM Reviews for Google Ads Content Moderation
Wei Qiao
Tushar Dogra
Otilia Stretcu
Yu-Han Lyu
Tiantian Fang
...
Chih-Chun Chia
Ariel Fuxman
Fangzhou Wang
Ranjay Krishna
Mehmet Tek
13
11
0
07 Feb 2024
ScreenAI: A Vision-Language Model for UI and Infographics Understanding
Gilles Baechler
Srinivas Sunkara
Maria Wang
Fedir Zubach
Hassan Mansoor
Vincent Etter
Victor Carbune
Jason Lin
Jindong Chen
Abhanshu Sharma
112
47
0
07 Feb 2024
Time-, Memory- and Parameter-Efficient Visual Adaptation
Otniel-Bogdan Mercea
Alexey Gritsenko
Cordelia Schmid
Anurag Arnab
VLM
35
13
0
05 Feb 2024
GeReA: Question-Aware Prompt Captions for Knowledge-based Visual Question Answering
Ziyu Ma
Shutao Li
Bin Sun
Jianfei Cai
Zuxiang Long
Fuyan Ma
13
1
0
04 Feb 2024
VIALM: A Survey and Benchmark of Visually Impaired Assistance with Large Models
Yi Zhao
Yilin Zhang
Rong Xiang
Jing Li
Hillming Li
20
16
0
29 Jan 2024
InstructDoc: A Dataset for Zero-Shot Generalization of Visual Document Understanding with Instructions
Ryota Tanaka
Taichi Iki
Kyosuke Nishida
Kuniko Saito
Jun Suzuki
VLM
11
23
0
24 Jan 2024
Red Teaming Visual Language Models
Mukai Li
Lei Li
Yuwei Yin
Masood Ahmed
Zhenguang Liu
Qi Liu
VLM
25
30
0
23 Jan 2024
CLIP feature-based randomized control using images and text for multiple tasks and robots
Kazuki Shibata
Hideki Deguchi
Shun Taguchi
19
1
0
18 Jan 2024
Distilling Vision-Language Models on Millions of Videos
Yue Zhao
Long Zhao
Xingyi Zhou
Jialin Wu
Chun-Te Chu
...
Hartwig Adam
Ting Liu
Boqing Gong
Philipp Krahenbuhl
Liangzhe Yuan
VLM
19
13
0
11 Jan 2024
Exploring the Reasoning Abilities of Multimodal Large Language Models (MLLMs): A Comprehensive Survey on Emerging Trends in Multimodal Reasoning
Yiqi Wang
Wentao Chen
Xiaotian Han
Xudong Lin
Haiteng Zhao
Yongfei Liu
Bohan Zhai
Jianbo Yuan
Quanzeng You
Hongxia Yang
LRM
33
66
0
10 Jan 2024
Language-Conditioned Robotic Manipulation with Fast and Slow Thinking
Minjie Zhu
Yichen Zhu
Jinming Li
Junjie Wen
Zhiyuan Xu
...
Chaomin Shen
Yaxin Peng
Dong Liu
Feifei Feng
Jian Tang
LM&Ro
16
15
0
08 Jan 2024
Towards Truly Zero-shot Compositional Visual Reasoning with LLMs as Programmers
Aleksandar Stanić
Sergi Caelles
Michael Tschannen
LRM
VLM
21
9
0
03 Jan 2024
Instruct-Imagen: Image Generation with Multi-modal Instruction
Hexiang Hu
Kelvin C. K. Chan
Yu-Chuan Su
Wenhu Chen
Yandong Li
...
Xue Ben
Boqing Gong
William W. Cohen
Ming-Wei Chang
Xuhui Jia
MLLM
33
42
0
03 Jan 2024
GPT-4V(ision) is a Generalist Web Agent, if Grounded
Boyuan Zheng
Boyu Gou
Jihyung Kil
Huan Sun
Yu-Chuan Su
MLLM
VLM
LLMAG
35
79
0
03 Jan 2024
Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision, Language, Audio, and Action
Jiasen Lu
Christopher Clark
Sangho Lee
Zichen Zhang
Savya Khosla
Ryan Marten
Derek Hoiem
Aniruddha Kembhavi
VLM
MLLM
27
143
0
28 Dec 2023
MobileVLM : A Fast, Strong and Open Vision Language Assistant for Mobile Devices
Xiangxiang Chu
Limeng Qiao
Xinyang Lin
Shuang Xu
Yang Yang
...
Fei Wei
Xinyu Zhang
Bo-Wen Zhang
Xiaolin Wei
Chunhua Shen
MLLM
21
14
0
28 Dec 2023
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
Zhe Chen
Jiannan Wu
Wenhai Wang
Weijie Su
Guo Chen
...
Bin Li
Ping Luo
Tong Lu
Yu Qiao
Jifeng Dai
VLM
MLLM
135
895
0
21 Dec 2023
GeomVerse: A Systematic Evaluation of Large Models for Geometric Reasoning
Mehran Kazemi
Hamidreza Alvari
Ankit Anand
Jialin Wu
Xi Chen
Radu Soricut
LRM
ReLM
12
19
0
19 Dec 2023
Previous
1
2
3
4
Next