ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2026 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2305.18565
  4. Cited By
PaLI-X: On Scaling up a Multilingual Vision and Language Model

PaLI-X: On Scaling up a Multilingual Vision and Language Model

29 May 2023
Xi Chen
Josip Djolonga
Piotr Padlewski
Basil Mustafa
Soravit Changpinyo
Jialin Wu
Carlos Riquelme Ruiz
Sebastian Goodman
Tianlin Li
Yi Tay
Siamak Shakeri
Mostafa Dehghani
Daniel M. Salz
Mario Lucic
Michael Tschannen
Arsha Nagrani
Hexiang Hu
Mandar Joshi
Bo Pang
Ceslee Montgomery
Paulina Pietrzyk
Marvin Ritter
A. Piergiovanni
Matthias Minderer
Filip Pavetić
Austin Waters
Gang Li
Ibrahim Alabdulmohsin
Lucas Beyer
J. Amelot
Kenton Lee
Andreas Steiner
Yang Li
Daniel Keysers
Anurag Arnab
Yuanzhong Xu
Keran Rong
Alexander Kolesnikov
Mojtaba Seyedhosseini
A. Angelova
Xiaohua Zhai
N. Houlsby
Radu Soricut
    VLM
ArXiv (abs)PDFHTMLHuggingFace (3 upvotes)

Papers citing "PaLI-X: On Scaling up a Multilingual Vision and Language Model"

50 / 101 papers shown
BiVLC: Extending Vision-Language Compositionality Evaluation with
  Text-to-Image Retrieval
BiVLC: Extending Vision-Language Compositionality Evaluation with Text-to-Image RetrievalNeural Information Processing Systems (NeurIPS), 2024
Imanol Miranda
Ander Salaberria
Eneko Agirre
Gorka Azkune
CoGe
242
5
0
14 Jun 2024
ChatSR: Multimodal Large Language Models for Scientific Formula Discovery
ChatSR: Multimodal Large Language Models for Scientific Formula Discovery
Yanjie Li
Weijun Li
Lina Yu
Min Wu
Jingyi Liu
Wenqiang Li
Shu Wei
Yusong Deng
OffRL
345
2
0
08 Jun 2024
Towards Semantic Equivalence of Tokenization in Multimodal LLM
Towards Semantic Equivalence of Tokenization in Multimodal LLMInternational Conference on Learning Representations (ICLR), 2024
Shengqiong Wu
Hao Fei
Xiangtai Li
Jiayi Ji
Hanwang Zhang
Tat-Seng Chua
Shuicheng Yan
MLLM
574
57
0
07 Jun 2024
A Survey of Multimodal Large Language Model from A Data-centric
  Perspective
A Survey of Multimodal Large Language Model from A Data-centric Perspective
Tianyi Bai
Hao Liang
Binwang Wan
Yanran Xu
Xi Li
...
Ping Huang
Jiulong Shan
Conghui He
Binhang Yuan
Wentao Zhang
383
64
0
26 May 2024
A Survey on Vision-Language-Action Models for Embodied AI
A Survey on Vision-Language-Action Models for Embodied AI
Yueen Ma
Zixing Song
Yuzheng Zhuang
Jianye Hao
Irwin King
LM&Ro
885
166
0
23 May 2024
What matters when building vision-language models?
What matters when building vision-language models?Neural Information Processing Systems (NeurIPS), 2024
Hugo Laurençon
Léo Tronchon
Matthieu Cord
Victor Sanh
VLM
302
274
0
03 May 2024
What Foundation Models can Bring for Robot Learning in Manipulation : A Survey
What Foundation Models can Bring for Robot Learning in Manipulation : A Survey
Dingzhe Li
Yixiang Jin
A. Yong
Yong A
Hongze Yu
...
Huaping Liu
Gang Hua
F. Sun
Jianwei Zhang
Bin Fang
AI4CELM&Ro
900
26
0
28 Apr 2024
BRAVE: Broadening the visual encoding of vision-language models
BRAVE: Broadening the visual encoding of vision-language modelsEuropean Conference on Computer Vision (ECCV), 2024
Ouguzhan Fatih Kar
A. Tonioni
Petra Poklukar
Achin Kulshrestha
Amir Zamir
Federico Tombari
MLLMVLM
297
57
0
10 Apr 2024
MoReVQA: Exploring Modular Reasoning Models for Video Question Answering
MoReVQA: Exploring Modular Reasoning Models for Video Question Answering
Juhong Min
Shyamal Buch
Arsha Nagrani
Minsu Cho
Cordelia Schmid
LRM
418
63
0
09 Apr 2024
IsoBench: Benchmarking Multimodal Foundation Models on Isomorphic
  Representations
IsoBench: Benchmarking Multimodal Foundation Models on Isomorphic Representations
Deqing Fu
Ghazal Khalighinejad
Ollie Liu
Bhuwan Dhingra
Dani Yogatama
Robin Jia
Willie Neiswanger
452
36
0
01 Apr 2024
MagicLens: Self-Supervised Image Retrieval with Open-Ended Instructions
MagicLens: Self-Supervised Image Retrieval with Open-Ended Instructions
Kai Zhang
Yi Luan
Hexiang Hu
Kenton Lee
Siyuan Qiao
Wenhu Chen
Yu-Chuan Su
Ming-Wei Chang
VLMLRM
297
73
0
28 Mar 2024
MineDreamer: Learning to Follow Instructions via Chain-of-Imagination
  for Simulated-World Control
MineDreamer: Learning to Follow Instructions via Chain-of-Imagination for Simulated-World Control
Enshen Zhou
Yiran Qin
Zhen-fei Yin
Yuzhou Huang
Ruimao Zhang
Lu Sheng
Yu Qiao
Jing Shao
LM&RoAI4CE
296
49
0
18 Mar 2024
Griffon v2: Advancing Multimodal Perception with High-Resolution Scaling and Visual-Language Co-Referring
Griffon v2: Advancing Multimodal Perception with High-Resolution Scaling and Visual-Language Co-Referring
Yufei Zhan
Yousong Zhu
Hongyin Zhao
Fan Yang
Fan Yang
Jinqiao Wang
Jinqiao Wang
ObjD
294
26
0
14 Mar 2024
Contrastive Region Guidance: Improving Grounding in Vision-Language
  Models without Training
Contrastive Region Guidance: Improving Grounding in Vision-Language Models without Training
David Wan
Jaemin Cho
Elias Stengel-Eskin
Mohit Bansal
VLMObjD
316
49
0
04 Mar 2024
Multimodal ArXiv: A Dataset for Improving Scientific Comprehension of
  Large Vision-Language Models
Multimodal ArXiv: A Dataset for Improving Scientific Comprehension of Large Vision-Language Models
Lei Li
Yuqi Wang
Runxin Xu
Peiyi Wang
Xiachong Feng
Lingpeng Kong
Qi Liu
358
96
0
01 Mar 2024
Learning by Watching: A Review of Video-based Learning Approaches for Robot Manipulation
Learning by Watching: A Review of Video-based Learning Approaches for Robot ManipulationIEEE Access (IEEE Access), 2024
Chrisantus Eze
Christopher Crick
SSL
466
16
0
11 Feb 2024
InkSight: Offline-to-Online Handwriting Conversion by Teaching Vision-Language Models to Read and Write
InkSight: Offline-to-Online Handwriting Conversion by Teaching Vision-Language Models to Read and Write
B. Mitrevski
Arina Rak
Julian Schnitzler
Chengkun Li
Andrii Maksai
Jesse Berent
C. Musat
DiffM
331
0
0
08 Feb 2024
Scaling Up LLM Reviews for Google Ads Content Moderation
Scaling Up LLM Reviews for Google Ads Content Moderation
Wei Qiao
Tushar Dogra
Otilia Stretcu
Yu-Han Lyu
Tiantian Fang
...
Chih-Chun Chia
Ariel Fuxman
Fangzhou Wang
Ranjay Krishna
Mehmet Tek
182
22
0
07 Feb 2024
ScreenAI: A Vision-Language Model for UI and Infographics Understanding
ScreenAI: A Vision-Language Model for UI and Infographics Understanding
Gilles Baechler
Srinivas Sunkara
Maria Wang
Fedir Zubach
Hassan Mansoor
Vincent Etter
Victor Carbune
Jason Lin
Jindong Chen
Abhanshu Sharma
846
96
0
07 Feb 2024
Time-, Memory- and Parameter-Efficient Visual Adaptation
Time-, Memory- and Parameter-Efficient Visual AdaptationComputer Vision and Pattern Recognition (CVPR), 2024
Otniel-Bogdan Mercea
Alexey Gritsenko
Cordelia Schmid
Anurag Arnab
VLM
191
22
0
05 Feb 2024
GeReA: Question-Aware Prompt Captions for Knowledge-based Visual
  Question Answering
GeReA: Question-Aware Prompt Captions for Knowledge-based Visual Question Answering
Ziyu Ma
Shutao Li
Bin Sun
Jianfei Cai
Zuxiang Long
Fuyan Ma
259
8
0
04 Feb 2024
VIALM: A Survey and Benchmark of Visually Impaired Assistance with Large
  Models
VIALM: A Survey and Benchmark of Visually Impaired Assistance with Large Models
Yi Zhao
Yilin Zhang
Rong Xiang
Jing Li
Hillming Li
333
26
0
29 Jan 2024
InstructDoc: A Dataset for Zero-Shot Generalization of Visual Document
  Understanding with Instructions
InstructDoc: A Dataset for Zero-Shot Generalization of Visual Document Understanding with InstructionsAAAI Conference on Artificial Intelligence (AAAI), 2024
Ryota Tanaka
Taichi Iki
Kyosuke Nishida
Kuniko Saito
Jun Suzuki
VLM
258
35
0
24 Jan 2024
CLIP feature-based randomized control using images and text for multiple
  tasks and robots
CLIP feature-based randomized control using images and text for multiple tasks and robots
Kazuki Shibata
Hideki Deguchi
Shun Taguchi
276
3
0
18 Jan 2024
Exploring the Reasoning Abilities of Multimodal Large Language Models
  (MLLMs): A Comprehensive Survey on Emerging Trends in Multimodal Reasoning
Exploring the Reasoning Abilities of Multimodal Large Language Models (MLLMs): A Comprehensive Survey on Emerging Trends in Multimodal Reasoning
Yiqi Wang
Wentao Chen
Xiaotian Han
Xudong Lin
Haiteng Zhao
Yongfei Liu
Bohan Zhai
Jianbo Yuan
Quanzeng You
Hongxia Yang
LRM
308
146
0
10 Jan 2024
Language-Conditioned Robotic Manipulation with Fast and Slow Thinking
Language-Conditioned Robotic Manipulation with Fast and Slow ThinkingIEEE International Conference on Robotics and Automation (ICRA), 2024
Minjie Zhu
Yichen Zhu
Jinming Li
Junjie Wen
Zhiyuan Xu
...
Yaxin Peng
Chaomin Shen
Dong Liu
Feifei Feng
Jian Tang
LM&Ro
228
26
0
08 Jan 2024
GPT-4V(ision) is a Generalist Web Agent, if Grounded
GPT-4V(ision) is a Generalist Web Agent, if GroundedInternational Conference on Machine Learning (ICML), 2024
Boyuan Zheng
Boyu Gou
Jihyung Kil
Huan Sun
Yu-Chuan Su
MLLMVLMLLMAG
381
403
0
03 Jan 2024
Lyrics: Boosting Fine-grained Language-Vision Alignment and
  Comprehension via Semantic-aware Visual Objects
Lyrics: Boosting Fine-grained Language-Vision Alignment and Comprehension via Semantic-aware Visual Objects
Junyu Lu
Ruyi Gan
Di Zhang
Xiaojun Wu
Ziwei Wu
Renliang Sun
Jiaxing Zhang
Pingjian Zhang
Yan Song
MLLMVLM
225
22
0
08 Dec 2023
Open-sourced Data Ecosystem in Autonomous Driving: the Present and
  Future
Open-sourced Data Ecosystem in Autonomous Driving: the Present and Future
Guoying Gu
Yang Li
Huijie Wang
Jia Zeng
Huilin Xu
...
Kai Yan
Beipeng Mu
Zhihui Peng
Shaoqing Ren
Yu Qiao
338
30
0
06 Dec 2023
Mismatch Quest: Visual and Textual Feedback for Image-Text Misalignment
Mismatch Quest: Visual and Textual Feedback for Image-Text MisalignmentEuropean Conference on Computer Vision (ECCV), 2023
Brian Gordon
Yonatan Bitton
Yonatan Shafir
Roopal Garg
Xi Chen
Dani Lischinski
Daniel Cohen-Or
Idan Szpektor
240
17
0
05 Dec 2023
SARA-RT: Scaling up Robotics Transformers with Self-Adaptive Robust
  Attention
SARA-RT: Scaling up Robotics Transformers with Self-Adaptive Robust AttentionIEEE International Conference on Robotics and Automation (ICRA), 2023
Isabel Leal
Krzysztof Choromanski
Deepali Jain
Kumar Avinava Dubey
Jake Varley
...
Q. Vuong
Tamás Sarlós
Kenneth Oslund
Karol Hausman
Kanishka Rao
219
20
0
04 Dec 2023
Leveraging VLM-Based Pipelines to Annotate 3D Objects
Leveraging VLM-Based Pipelines to Annotate 3D ObjectsInternational Conference on Machine Learning (ICML), 2023
Rishabh Kabra
Loic Matthey
Alexander Lerchner
Niloy J. Mitra
274
10
0
29 Nov 2023
MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning
  Benchmark for Expert AGI
MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGIComputer Vision and Pattern Recognition (CVPR), 2023
Xiang Yue
Yuansheng Ni
Kai Zhang
Tianyu Zheng
Ruoqi Liu
...
Yibo Liu
Wenhao Huang
Huan Sun
Yu-Chuan Su
Wenhu Chen
OSLMELMVLM
852
1,620
0
27 Nov 2023
EVCap: Retrieval-Augmented Image Captioning with External Visual-Name
  Memory for Open-World Comprehension
EVCap: Retrieval-Augmented Image Captioning with External Visual-Name Memory for Open-World ComprehensionComputer Vision and Pattern Recognition (CVPR), 2023
Jiaxuan Li
D. Vo
Akihiro Sugimoto
Hideki Nakayama
KELMVLM
261
43
0
27 Nov 2023
Igniting Language Intelligence: The Hitchhiker's Guide From
  Chain-of-Thought Reasoning to Language Agents
Igniting Language Intelligence: The Hitchhiker's Guide From Chain-of-Thought Reasoning to Language Agents
Zhuosheng Zhang
Yao Yao
Aston Zhang
Xiangru Tang
Xinbei Ma
...
Yiming Wang
Mark B. Gerstein
Rui Wang
Gongshen Liu
Hai Zhao
LLMAGLM&RoLRM
363
92
0
20 Nov 2023
Large Language Models for Robotics: A Survey
Large Language Models for Robotics: A Survey
Fanlong Zeng
Wensheng Gan
Zezheng Huai
Lichao Sun
Hechang Chen
Yongheng Wang
Ning Liu
Philip S. Yu
LM&Ro
373
201
0
13 Nov 2023
Florence-2: Advancing a Unified Representation for a Variety of Vision
  Tasks
Florence-2: Advancing a Unified Representation for a Variety of Vision TasksComputer Vision and Pattern Recognition (CVPR), 2023
Bin Xiao
Haiping Wu
Weijian Xu
Xiyang Dai
Houdong Hu
Yumao Lu
Michael Zeng
Ce Liu
Lu Yuan
VLM
395
383
0
10 Nov 2023
OtterHD: A High-Resolution Multi-modality Model
OtterHD: A High-Resolution Multi-modality Model
Yue Liu
Peiyuan Zhang
Jingkang Yang
Yuanhan Zhang
Fanyi Pu
Ziwei Liu
VLMMLLM
187
76
0
07 Nov 2023
CogVLM: Visual Expert for Pretrained Language Models
CogVLM: Visual Expert for Pretrained Language ModelsNeural Information Processing Systems (NeurIPS), 2023
Weihan Wang
Qingsong Lv
Wenmeng Yu
Wenyi Hong
Ji Qi
...
Bin Xu
Juanzi Li
Yuxiao Dong
Ming Ding
Jie Tang
VLMMLLM
667
709
0
06 Nov 2023
De-Diffusion Makes Text a Strong Cross-Modal Interface
De-Diffusion Makes Text a Strong Cross-Modal InterfaceComputer Vision and Pattern Recognition (CVPR), 2023
Chen Wei
Chenxi Liu
Siyuan Qiao
Zhishuai Zhang
Alan Yuille
Jiahui Yu
VLMDiffM
270
17
0
01 Nov 2023
Advances in Embodied Navigation Using Large Language Models: A Survey
Advances in Embodied Navigation Using Large Language Models: A Survey
Jinzhou Lin
Han Gao
Xuxiang Feng
Rongtao Xu
Changwei Wang
Man Zhang
Li Guo
Shibiao Xu
LM&RoLLMAG
759
21
0
01 Nov 2023
DOMINO: A Dual-System for Multi-step Visual Language Reasoning
DOMINO: A Dual-System for Multi-step Visual Language Reasoning
Peifang Wang
O. Yu. Golovneva
Armen Aghajanyan
Xiang Ren
Muhao Chen
Asli Celikyilmaz
Maryam Fazel-Zarandi
LRM
173
13
0
04 Oct 2023
Beyond Task Performance: Evaluating and Reducing the Flaws of Large
  Multimodal Models with In-Context Learning
Beyond Task Performance: Evaluating and Reducing the Flaws of Large Multimodal Models with In-Context LearningInternational Conference on Learning Representations (ICLR), 2023
Mustafa Shukor
Alexandre Ramé
Corentin Dancette
Matthieu Cord
LRMMLLM
428
26
0
01 Oct 2023
CausalLM is not optimal for in-context learning
CausalLM is not optimal for in-context learningInternational Conference on Learning Representations (ICLR), 2023
Nan Ding
Tomer Levinboim
Jialin Wu
Sebastian Goodman
Radu Soricut
205
31
0
14 Aug 2023
RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic
  Control
RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic ControlConference on Robot Learning (CoRL), 2023
Anthony Brohan
Noah Brown
Justice Carbajal
Yevgen Chebotar
Xi Chen
...
Ted Xiao
Peng Xu
Sichun Xu
Tianhe Yu
Brianna Zitkovich
LM&RoLRM
578
2,155
0
28 Jul 2023
Emu: Generative Pretraining in Multimodality
Emu: Generative Pretraining in MultimodalityInternational Conference on Learning Representations (ICLR), 2023
Quan-Sen Sun
Qiying Yu
Yufeng Cui
Fan Zhang
Xiaosong Zhang
Yueze Wang
Hongcheng Gao
Jingjing Liu
Tiejun Huang
Xinlong Wang
MLLM
359
155
0
11 Jul 2023
Dense Video Object Captioning from Disjoint Supervision
Dense Video Object Captioning from Disjoint SupervisionInternational Conference on Learning Representations (ICLR), 2023
Xingyi Zhou
Anurag Arnab
Chen Sun
Cordelia Schmid
286
7
0
20 Jun 2023
Weakly-Supervised Learning of Visual Relations in Multimodal Pretraining
Weakly-Supervised Learning of Visual Relations in Multimodal PretrainingConference on Empirical Methods in Natural Language Processing (EMNLP), 2023
Emanuele Bugliarello
Aida Nematzadeh
Lisa Anne Hendricks
SSL
295
6
0
23 May 2023
Otter: A Multi-Modal Model with In-Context Instruction Tuning
Otter: A Multi-Modal Model with In-Context Instruction TuningIEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2023
Yue Liu
Yuanhan Zhang
Liangyu Chen
Jinghao Wang
Fanyi Pu
Joshua Adrian Cahyono
Jingkang Yang
Yu Qiao
MLLM
515
620
0
05 May 2023
Subject-driven Text-to-Image Generation via Apprenticeship Learning
Subject-driven Text-to-Image Generation via Apprenticeship LearningNeural Information Processing Systems (NeurIPS), 2023
Wenhu Chen
Hexiang Hu
Yandong Li
Nataniel Rui
Xuhui Jia
Ming-Wei Chang
William W. Cohen
DiffM
919
227
0
01 Apr 2023
Previous
123
Next