Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales
Search
Open menu
Home
Papers
2305.18565
Cited By
PaLI-X: On Scaling up a Multilingual Vision and Language Model
29 May 2023
Xi Chen
Josip Djolonga
Piotr Padlewski
Basil Mustafa
Soravit Changpinyo
Jialin Wu
Carlos Riquelme Ruiz
Sebastian Goodman
Tianlin Li
Yi Tay
Siamak Shakeri
Mostafa Dehghani
Daniel M. Salz
Mario Lucic
Michael Tschannen
Arsha Nagrani
Hexiang Hu
Mandar Joshi
Bo Pang
Ceslee Montgomery
Paulina Pietrzyk
Marvin Ritter
A. Piergiovanni
Matthias Minderer
Filip Pavetić
Austin Waters
Gang Li
Ibrahim Alabdulmohsin
Lucas Beyer
J. Amelot
Kenton Lee
Andreas Steiner
Yang Li
Daniel Keysers
Anurag Arnab
Yuanzhong Xu
Keran Rong
Alexander Kolesnikov
Mojtaba Seyedhosseini
A. Angelova
Xiaohua Zhai
N. Houlsby
Radu Soricut
VLM
Re-assign community
ArXiv (abs)
PDF
HTML
HuggingFace (3 upvotes)
Papers citing
"PaLI-X: On Scaling up a Multilingual Vision and Language Model"
50 / 101 papers shown
BiVLC: Extending Vision-Language Compositionality Evaluation with Text-to-Image Retrieval
Neural Information Processing Systems (NeurIPS), 2024
Imanol Miranda
Ander Salaberria
Eneko Agirre
Gorka Azkune
CoGe
242
5
0
14 Jun 2024
ChatSR: Multimodal Large Language Models for Scientific Formula Discovery
Yanjie Li
Weijun Li
Lina Yu
Min Wu
Jingyi Liu
Wenqiang Li
Shu Wei
Yusong Deng
OffRL
345
2
0
08 Jun 2024
Towards Semantic Equivalence of Tokenization in Multimodal LLM
International Conference on Learning Representations (ICLR), 2024
Shengqiong Wu
Hao Fei
Xiangtai Li
Jiayi Ji
Hanwang Zhang
Tat-Seng Chua
Shuicheng Yan
MLLM
574
57
0
07 Jun 2024
A Survey of Multimodal Large Language Model from A Data-centric Perspective
Tianyi Bai
Hao Liang
Binwang Wan
Yanran Xu
Xi Li
...
Ping Huang
Jiulong Shan
Conghui He
Binhang Yuan
Wentao Zhang
383
64
0
26 May 2024
A Survey on Vision-Language-Action Models for Embodied AI
Yueen Ma
Zixing Song
Yuzheng Zhuang
Jianye Hao
Irwin King
LM&Ro
885
166
0
23 May 2024
What matters when building vision-language models?
Neural Information Processing Systems (NeurIPS), 2024
Hugo Laurençon
Léo Tronchon
Matthieu Cord
Victor Sanh
VLM
302
274
0
03 May 2024
What Foundation Models can Bring for Robot Learning in Manipulation : A Survey
Dingzhe Li
Yixiang Jin
A. Yong
Yong A
Hongze Yu
...
Huaping Liu
Gang Hua
F. Sun
Jianwei Zhang
Bin Fang
AI4CE
LM&Ro
900
26
0
28 Apr 2024
BRAVE: Broadening the visual encoding of vision-language models
European Conference on Computer Vision (ECCV), 2024
Ouguzhan Fatih Kar
A. Tonioni
Petra Poklukar
Achin Kulshrestha
Amir Zamir
Federico Tombari
MLLM
VLM
297
57
0
10 Apr 2024
MoReVQA: Exploring Modular Reasoning Models for Video Question Answering
Juhong Min
Shyamal Buch
Arsha Nagrani
Minsu Cho
Cordelia Schmid
LRM
418
63
0
09 Apr 2024
IsoBench: Benchmarking Multimodal Foundation Models on Isomorphic Representations
Deqing Fu
Ghazal Khalighinejad
Ollie Liu
Bhuwan Dhingra
Dani Yogatama
Robin Jia
Willie Neiswanger
452
36
0
01 Apr 2024
MagicLens: Self-Supervised Image Retrieval with Open-Ended Instructions
Kai Zhang
Yi Luan
Hexiang Hu
Kenton Lee
Siyuan Qiao
Wenhu Chen
Yu-Chuan Su
Ming-Wei Chang
VLM
LRM
297
73
0
28 Mar 2024
MineDreamer: Learning to Follow Instructions via Chain-of-Imagination for Simulated-World Control
Enshen Zhou
Yiran Qin
Zhen-fei Yin
Yuzhou Huang
Ruimao Zhang
Lu Sheng
Yu Qiao
Jing Shao
LM&Ro
AI4CE
296
49
0
18 Mar 2024
Griffon v2: Advancing Multimodal Perception with High-Resolution Scaling and Visual-Language Co-Referring
Yufei Zhan
Yousong Zhu
Hongyin Zhao
Fan Yang
Fan Yang
Jinqiao Wang
Jinqiao Wang
ObjD
294
26
0
14 Mar 2024
Contrastive Region Guidance: Improving Grounding in Vision-Language Models without Training
David Wan
Jaemin Cho
Elias Stengel-Eskin
Mohit Bansal
VLM
ObjD
316
49
0
04 Mar 2024
Multimodal ArXiv: A Dataset for Improving Scientific Comprehension of Large Vision-Language Models
Lei Li
Yuqi Wang
Runxin Xu
Peiyi Wang
Xiachong Feng
Lingpeng Kong
Qi Liu
358
96
0
01 Mar 2024
Learning by Watching: A Review of Video-based Learning Approaches for Robot Manipulation
IEEE Access (IEEE Access), 2024
Chrisantus Eze
Christopher Crick
SSL
466
16
0
11 Feb 2024
InkSight: Offline-to-Online Handwriting Conversion by Teaching Vision-Language Models to Read and Write
B. Mitrevski
Arina Rak
Julian Schnitzler
Chengkun Li
Andrii Maksai
Jesse Berent
C. Musat
DiffM
331
0
0
08 Feb 2024
Scaling Up LLM Reviews for Google Ads Content Moderation
Wei Qiao
Tushar Dogra
Otilia Stretcu
Yu-Han Lyu
Tiantian Fang
...
Chih-Chun Chia
Ariel Fuxman
Fangzhou Wang
Ranjay Krishna
Mehmet Tek
182
22
0
07 Feb 2024
ScreenAI: A Vision-Language Model for UI and Infographics Understanding
Gilles Baechler
Srinivas Sunkara
Maria Wang
Fedir Zubach
Hassan Mansoor
Vincent Etter
Victor Carbune
Jason Lin
Jindong Chen
Abhanshu Sharma
846
96
0
07 Feb 2024
Time-, Memory- and Parameter-Efficient Visual Adaptation
Computer Vision and Pattern Recognition (CVPR), 2024
Otniel-Bogdan Mercea
Alexey Gritsenko
Cordelia Schmid
Anurag Arnab
VLM
191
22
0
05 Feb 2024
GeReA: Question-Aware Prompt Captions for Knowledge-based Visual Question Answering
Ziyu Ma
Shutao Li
Bin Sun
Jianfei Cai
Zuxiang Long
Fuyan Ma
259
8
0
04 Feb 2024
VIALM: A Survey and Benchmark of Visually Impaired Assistance with Large Models
Yi Zhao
Yilin Zhang
Rong Xiang
Jing Li
Hillming Li
333
26
0
29 Jan 2024
InstructDoc: A Dataset for Zero-Shot Generalization of Visual Document Understanding with Instructions
AAAI Conference on Artificial Intelligence (AAAI), 2024
Ryota Tanaka
Taichi Iki
Kyosuke Nishida
Kuniko Saito
Jun Suzuki
VLM
258
35
0
24 Jan 2024
CLIP feature-based randomized control using images and text for multiple tasks and robots
Kazuki Shibata
Hideki Deguchi
Shun Taguchi
276
3
0
18 Jan 2024
Exploring the Reasoning Abilities of Multimodal Large Language Models (MLLMs): A Comprehensive Survey on Emerging Trends in Multimodal Reasoning
Yiqi Wang
Wentao Chen
Xiaotian Han
Xudong Lin
Haiteng Zhao
Yongfei Liu
Bohan Zhai
Jianbo Yuan
Quanzeng You
Hongxia Yang
LRM
308
146
0
10 Jan 2024
Language-Conditioned Robotic Manipulation with Fast and Slow Thinking
IEEE International Conference on Robotics and Automation (ICRA), 2024
Minjie Zhu
Yichen Zhu
Jinming Li
Junjie Wen
Zhiyuan Xu
...
Yaxin Peng
Chaomin Shen
Dong Liu
Feifei Feng
Jian Tang
LM&Ro
228
26
0
08 Jan 2024
GPT-4V(ision) is a Generalist Web Agent, if Grounded
International Conference on Machine Learning (ICML), 2024
Boyuan Zheng
Boyu Gou
Jihyung Kil
Huan Sun
Yu-Chuan Su
MLLM
VLM
LLMAG
381
403
0
03 Jan 2024
Lyrics: Boosting Fine-grained Language-Vision Alignment and Comprehension via Semantic-aware Visual Objects
Junyu Lu
Ruyi Gan
Di Zhang
Xiaojun Wu
Ziwei Wu
Renliang Sun
Jiaxing Zhang
Pingjian Zhang
Yan Song
MLLM
VLM
225
22
0
08 Dec 2023
Open-sourced Data Ecosystem in Autonomous Driving: the Present and Future
Guoying Gu
Yang Li
Huijie Wang
Jia Zeng
Huilin Xu
...
Kai Yan
Beipeng Mu
Zhihui Peng
Shaoqing Ren
Yu Qiao
338
30
0
06 Dec 2023
Mismatch Quest: Visual and Textual Feedback for Image-Text Misalignment
European Conference on Computer Vision (ECCV), 2023
Brian Gordon
Yonatan Bitton
Yonatan Shafir
Roopal Garg
Xi Chen
Dani Lischinski
Daniel Cohen-Or
Idan Szpektor
240
17
0
05 Dec 2023
SARA-RT: Scaling up Robotics Transformers with Self-Adaptive Robust Attention
IEEE International Conference on Robotics and Automation (ICRA), 2023
Isabel Leal
Krzysztof Choromanski
Deepali Jain
Kumar Avinava Dubey
Jake Varley
...
Q. Vuong
Tamás Sarlós
Kenneth Oslund
Karol Hausman
Kanishka Rao
219
20
0
04 Dec 2023
Leveraging VLM-Based Pipelines to Annotate 3D Objects
International Conference on Machine Learning (ICML), 2023
Rishabh Kabra
Loic Matthey
Alexander Lerchner
Niloy J. Mitra
274
10
0
29 Nov 2023
MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI
Computer Vision and Pattern Recognition (CVPR), 2023
Xiang Yue
Yuansheng Ni
Kai Zhang
Tianyu Zheng
Ruoqi Liu
...
Yibo Liu
Wenhao Huang
Huan Sun
Yu-Chuan Su
Wenhu Chen
OSLM
ELM
VLM
852
1,620
0
27 Nov 2023
EVCap: Retrieval-Augmented Image Captioning with External Visual-Name Memory for Open-World Comprehension
Computer Vision and Pattern Recognition (CVPR), 2023
Jiaxuan Li
D. Vo
Akihiro Sugimoto
Hideki Nakayama
KELM
VLM
261
43
0
27 Nov 2023
Igniting Language Intelligence: The Hitchhiker's Guide From Chain-of-Thought Reasoning to Language Agents
Zhuosheng Zhang
Yao Yao
Aston Zhang
Xiangru Tang
Xinbei Ma
...
Yiming Wang
Mark B. Gerstein
Rui Wang
Gongshen Liu
Hai Zhao
LLMAG
LM&Ro
LRM
363
92
0
20 Nov 2023
Large Language Models for Robotics: A Survey
Fanlong Zeng
Wensheng Gan
Zezheng Huai
Lichao Sun
Hechang Chen
Yongheng Wang
Ning Liu
Philip S. Yu
LM&Ro
373
201
0
13 Nov 2023
Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks
Computer Vision and Pattern Recognition (CVPR), 2023
Bin Xiao
Haiping Wu
Weijian Xu
Xiyang Dai
Houdong Hu
Yumao Lu
Michael Zeng
Ce Liu
Lu Yuan
VLM
395
383
0
10 Nov 2023
OtterHD: A High-Resolution Multi-modality Model
Yue Liu
Peiyuan Zhang
Jingkang Yang
Yuanhan Zhang
Fanyi Pu
Ziwei Liu
VLM
MLLM
187
76
0
07 Nov 2023
CogVLM: Visual Expert for Pretrained Language Models
Neural Information Processing Systems (NeurIPS), 2023
Weihan Wang
Qingsong Lv
Wenmeng Yu
Wenyi Hong
Ji Qi
...
Bin Xu
Juanzi Li
Yuxiao Dong
Ming Ding
Jie Tang
VLM
MLLM
667
709
0
06 Nov 2023
De-Diffusion Makes Text a Strong Cross-Modal Interface
Computer Vision and Pattern Recognition (CVPR), 2023
Chen Wei
Chenxi Liu
Siyuan Qiao
Zhishuai Zhang
Alan Yuille
Jiahui Yu
VLM
DiffM
270
17
0
01 Nov 2023
Advances in Embodied Navigation Using Large Language Models: A Survey
Jinzhou Lin
Han Gao
Xuxiang Feng
Rongtao Xu
Changwei Wang
Man Zhang
Li Guo
Shibiao Xu
LM&Ro
LLMAG
759
21
0
01 Nov 2023
DOMINO: A Dual-System for Multi-step Visual Language Reasoning
Peifang Wang
O. Yu. Golovneva
Armen Aghajanyan
Xiang Ren
Muhao Chen
Asli Celikyilmaz
Maryam Fazel-Zarandi
LRM
173
13
0
04 Oct 2023
Beyond Task Performance: Evaluating and Reducing the Flaws of Large Multimodal Models with In-Context Learning
International Conference on Learning Representations (ICLR), 2023
Mustafa Shukor
Alexandre Ramé
Corentin Dancette
Matthieu Cord
LRM
MLLM
428
26
0
01 Oct 2023
CausalLM is not optimal for in-context learning
International Conference on Learning Representations (ICLR), 2023
Nan Ding
Tomer Levinboim
Jialin Wu
Sebastian Goodman
Radu Soricut
205
31
0
14 Aug 2023
RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control
Conference on Robot Learning (CoRL), 2023
Anthony Brohan
Noah Brown
Justice Carbajal
Yevgen Chebotar
Xi Chen
...
Ted Xiao
Peng Xu
Sichun Xu
Tianhe Yu
Brianna Zitkovich
LM&Ro
LRM
578
2,155
0
28 Jul 2023
Emu: Generative Pretraining in Multimodality
International Conference on Learning Representations (ICLR), 2023
Quan-Sen Sun
Qiying Yu
Yufeng Cui
Fan Zhang
Xiaosong Zhang
Yueze Wang
Hongcheng Gao
Jingjing Liu
Tiejun Huang
Xinlong Wang
MLLM
359
155
0
11 Jul 2023
Dense Video Object Captioning from Disjoint Supervision
International Conference on Learning Representations (ICLR), 2023
Xingyi Zhou
Anurag Arnab
Chen Sun
Cordelia Schmid
286
7
0
20 Jun 2023
Weakly-Supervised Learning of Visual Relations in Multimodal Pretraining
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2023
Emanuele Bugliarello
Aida Nematzadeh
Lisa Anne Hendricks
SSL
295
6
0
23 May 2023
Otter: A Multi-Modal Model with In-Context Instruction Tuning
IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2023
Yue Liu
Yuanhan Zhang
Liangyu Chen
Jinghao Wang
Fanyi Pu
Joshua Adrian Cahyono
Jingkang Yang
Yu Qiao
MLLM
515
620
0
05 May 2023
Subject-driven Text-to-Image Generation via Apprenticeship Learning
Neural Information Processing Systems (NeurIPS), 2023
Wenhu Chen
Hexiang Hu
Yandong Li
Nataniel Rui
Xuhui Jia
Ming-Wei Chang
William W. Cohen
DiffM
919
227
0
01 Apr 2023
Previous
1
2
3
Next