ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2026 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2305.18565
  4. Cited By
PaLI-X: On Scaling up a Multilingual Vision and Language Model

PaLI-X: On Scaling up a Multilingual Vision and Language Model

29 May 2023
Xi Chen
Josip Djolonga
Piotr Padlewski
Basil Mustafa
Soravit Changpinyo
Jialin Wu
Carlos Riquelme Ruiz
Sebastian Goodman
Tianlin Li
Yi Tay
Siamak Shakeri
Mostafa Dehghani
Daniel M. Salz
Mario Lucic
Michael Tschannen
Arsha Nagrani
Hexiang Hu
Mandar Joshi
Bo Pang
Ceslee Montgomery
Paulina Pietrzyk
Marvin Ritter
A. Piergiovanni
Matthias Minderer
Filip Pavetić
Austin Waters
Gang Li
Ibrahim Alabdulmohsin
Lucas Beyer
J. Amelot
Kenton Lee
Andreas Steiner
Yang Li
Daniel Keysers
Anurag Arnab
Yuanzhong Xu
Keran Rong
Alexander Kolesnikov
Mojtaba Seyedhosseini
A. Angelova
Xiaohua Zhai
N. Houlsby
Radu Soricut
    VLM
ArXiv (abs)PDFHTMLHuggingFace (3 upvotes)

Papers citing "PaLI-X: On Scaling up a Multilingual Vision and Language Model"

50 / 101 papers shown
Sigma: The Key for Vision-Language-Action Models toward Telepathic Alignment
Sigma: The Key for Vision-Language-Action Models toward Telepathic Alignment
Libo Wang
114
0
0
30 Nov 2025
Co-Training Vision Language Models for Remote Sensing Multi-task Learning
Co-Training Vision Language Models for Remote Sensing Multi-task Learning
Qingyun Li
Shuran Ma
Junwei Luo
Yi Yu
Yue Zhou
...
Xudong Lu
Xiaoxing Wang
Xin He
Yushi Chen
Xue Yang
179
1
0
26 Nov 2025
CounterVQA: Evaluating and Improving Counterfactual Reasoning in Vision-Language Models for Video Understanding
CounterVQA: Evaluating and Improving Counterfactual Reasoning in Vision-Language Models for Video Understanding
Yuefei Chen
Jiang Liu
Xiaodong Lin
Ruixiang Tang
LRM
215
0
0
25 Nov 2025
ViPRA: Video Prediction for Robot Actions
ViPRA: Video Prediction for Robot Actions
Sandeep Routray
Hengkai Pan
Unnat Jain
Shikhar Bahl
Deepak Pathak
231
2
0
11 Nov 2025
The Impact of Image Resolution on Biomedical Multimodal Large Language Models
The Impact of Image Resolution on Biomedical Multimodal Large Language Models
Liangyu Chen
James Burgess
Jeffrey Nirschl
Orr Zohar
Serena Yeung-Levy
99
0
0
21 Oct 2025
When Seeing Is not Enough: Revealing the Limits of Active Reasoning in MLLMs
When Seeing Is not Enough: Revealing the Limits of Active Reasoning in MLLMs
Hongcheng Liu
Pingjie Wang
Yuhao Wang
Siqu Ou
Yanfeng Wang
Y Samuel Wang
LRM
156
0
0
17 Oct 2025
Prompt-Guided Spatial Understanding with RGB-D Transformers for Fine-Grained Object Relation Reasoning
Prompt-Guided Spatial Understanding with RGB-D Transformers for Fine-Grained Object Relation Reasoning
Tanner Muturi
Blessing Agyei Kyem
Joshua Kofi Asamoah
Neema Jakisa Owor
Richard Dyzinela
Andrews Danyo
Y. Adu-Gyamfi
Armstrong Aboah
LRM
122
3
0
13 Oct 2025
Goal-oriented Backdoor Attack against Vision-Language-Action Models via Physical Objects
Goal-oriented Backdoor Attack against Vision-Language-Action Models via Physical Objects
Zirun Zhou
Zhengyang Xiao
Haochuan Xu
Jing Sun
Di Wang
Jingfeng Zhang
AAML
134
1
0
10 Oct 2025
HAMLET: Switch your Vision-Language-Action Model into a History-Aware Policy
HAMLET: Switch your Vision-Language-Action Model into a History-Aware Policy
Myungkyu Koo
Daewon Choi
Taeyoung Kim
Kyungmin Lee
Changyeon Kim
Younggyo Seo
Jinwoo Shin
LM&RoVLM
175
0
0
01 Oct 2025
POVQA: Preference-Optimized Video Question Answering with Rationales for Data Efficiency
POVQA: Preference-Optimized Video Question Answering with Rationales for Data Efficiency
Ashim Dahal
Ankit Ghimire
Saydul Akbar Murad
Nick Rahimi
144
0
0
01 Oct 2025
Multilingual Vision-Language Models, A Survey
Multilingual Vision-Language Models, A Survey
Andrei-Alexandru Manea
Jindřich Libovický
VLM
143
1
0
26 Sep 2025
Pyramid Token Pruning for High-Resolution Large Vision-Language Models via Region, Token, and Instruction-Guided Importance
Pyramid Token Pruning for High-Resolution Large Vision-Language Models via Region, Token, and Instruction-Guided Importance
Yuxuan Liang
Xu Li
Xiaolei Chen
Yi Zheng
Haotian Chen
Bin Li
Xiangyang Xue
VLM
152
0
0
19 Sep 2025
Self-Improving Embodied Foundation Models
Self-Improving Embodied Foundation Models
Seyed Kamyar Seyed Ghasemipour
Ayzaan Wahid
Jonathan Tompson
Pannag R Sanketi
Igor Mordatch
LM&RoLRM
148
6
0
18 Sep 2025
Align-Then-stEer: Adapting the Vision-Language Action Models through Unified Latent Guidance
Align-Then-stEer: Adapting the Vision-Language Action Models through Unified Latent Guidance
Y. Zhang
C. Wang
Ouyang Lu
Yuan Zhao
Yunfei Ge
Zhenglong Sun
Xiu Li
Chi Zhang
Chenjia Bai
Xuelong Li
221
6
0
02 Sep 2025
OpenVision 2: A Family of Generative Pretrained Visual Encoders for Multimodal Learning
OpenVision 2: A Family of Generative Pretrained Visual Encoders for Multimodal Learning
Yanqing Liu
Xianhang Li
Letian Zhang
Zirui Wang
Zeyu Zheng
Yuyin Zhou
Cihang Xie
VLM
201
2
0
01 Sep 2025
Large VLM-based Vision-Language-Action Models for Robotic Manipulation: A Survey
Large VLM-based Vision-Language-Action Models for Robotic Manipulation: A Survey
Rui Shao
W. Li
Lingsen Zhang
Renshan Zhang
Zhiyang Liu
Ran Chen
Liqiang Nie
LM&Ro
247
27
0
18 Aug 2025
FedVLA: Federated Vision-Language-Action Learning with Dual Gating Mixture-of-Experts for Robotic Manipulation
FedVLA: Federated Vision-Language-Action Learning with Dual Gating Mixture-of-Experts for Robotic Manipulation
Cui Miao
Tao Chang
Meihan Wu
Hongbin Xu
Chun Li
Ming Li
Xiaodong Wang
FedML
141
5
0
04 Aug 2025
GR-3 Technical Report
GR-3 Technical Report
Chilam Cheang
S. Chen
Zhongren Cui
Yingdong Hu
Liqun Huang
...
Hongtao Wu
Xin Xiao
Yuyang Xiao
Jiafeng Xu
Yichu Yang
320
46
0
21 Jul 2025
CronusVLA: Towards Efficient and Robust Manipulation via Multi-Frame Vision-Language-Action Modeling
CronusVLA: Towards Efficient and Robust Manipulation via Multi-Frame Vision-Language-Action Modeling
Hao Li
Shuai Yang
Yilun Chen
Xinyi Chen
Xiaoda Yang
...
Hanqing Wang
Tai Wang
Dahua Lin
Feng Zhao
Jiangmiao Pang
200
6
0
24 Jun 2025
Adapting Vision-Language Models for Evaluating World Models
Adapting Vision-Language Models for Evaluating World Models
Mariya Hendriksen
Tabish Rashid
David Bignell
Raluca Georgescu
Abdelhak Lemkhenter
Katja Hofmann
Sam Devlin
Sarah Parisot
189
0
0
22 Jun 2025
Vision Generalist Model: A Survey
Vision Generalist Model: A SurveyInternational Journal of Computer Vision (IJCV), 2025
Ziyi Wang
Yongming Rao
Shuofeng Sun
Xinrun Liu
Yi Wei
...
Zuyan Liu
Yanbo Wang
Hongmin Liu
Jie Zhou
Jiwen Lu
293
0
0
11 Jun 2025
Sensory-Motor Control with Large Language Models via Iterative Policy Refinement
Sensory-Motor Control with Large Language Models via Iterative Policy Refinement
J. Carvalho
S. Nolfi
LM&Ro
362
0
0
05 Jun 2025
SEM: Enhancing Spatial Understanding for Robust Robot Manipulation
SEM: Enhancing Spatial Understanding for Robust Robot Manipulation
Xuewu Lin
Tianwei Lin
Lichao Huang
Hongyu Xie
Yiwei Jin
Keyu Li
Zhizhong Su
302
3
0
22 May 2025
Traveling Across Languages: Benchmarking Cross-Lingual Consistency in Multimodal LLMs
Traveling Across Languages: Benchmarking Cross-Lingual Consistency in Multimodal LLMs
Hao Wang
Pinzhi Huang
Jihan Yang
Saining Xie
Daisuke Kawahara
474
1
0
21 May 2025
Behind Maya: Building a Multilingual Vision Language Model
Behind Maya: Building a Multilingual Vision Language Model
Nahid Alam
Karthik Reddy Kanjula
Surya Guthikonda
Timothy Chung
Bala Krishna S Vegesna
...
Isha Chaturvedi
Genta Indra Winata
Ashvanth.S
Snehanshu Mukherjee
Alham Fikri Aji
MLLMVLM
301
2
0
13 May 2025
A0: An Affordance-Aware Hierarchical Model for General Robotic Manipulation
A0: An Affordance-Aware Hierarchical Model for General Robotic Manipulation
Rongtao Xu
Junxuan Zhang
Minghao Guo
Youpeng Wen
H. Yang
...
Liqiong Wang
Yuxuan Kuang
Meng Cao
Feng Zheng
Xiaodan Liang
620
29
0
17 Apr 2025
Multimodal Fusion and Vision-Language Models: A Survey for Robot Vision
Multimodal Fusion and Vision-Language Models: A Survey for Robot VisionInformation Fusion (Inf. Fusion), 2025
Xiaofeng Han
Shunpeng Chen
Zenghuang Fu
Zhe Feng
Lue Fan
...
Li Guo
Weiliang Meng
Xiaopeng Zhang
Rongtao Xu
Shibiao Xu
439
37
0
03 Apr 2025
RoboFlamingo-Plus: Fusion of Depth and RGB Perception with Vision-Language Models for Enhanced Robotic Manipulation
RoboFlamingo-Plus: Fusion of Depth and RGB Perception with Vision-Language Models for Enhanced Robotic ManipulationInternational Conference on Real-time Computing and Robotics (ICRCR), 2025
Sheng Wang
VLM
337
10
0
25 Mar 2025
Aligning Vision to Language: Annotation-Free Multimodal Knowledge Graph Construction for Enhanced LLMs Reasoning
Aligning Vision to Language: Annotation-Free Multimodal Knowledge Graph Construction for Enhanced LLMs Reasoning
Junming Liu
Siyuan Meng
Yanting Gao
Song Mao
Pinlong Cai
Guohang Yan
Yirong Chen
Zilin Bian
Ding Wang
Botian Shi
366
12
0
17 Mar 2025
Perceiving, Reasoning, Adapting: A Dual-Layer Framework for VLM-Guided Precision Robotic Manipulation
Qingxuan Jia
Guoqin Tang
Zeyuan Huang
Zixuan Hao
Ning Ji
Shihang
Gang Chen
219
0
0
07 Mar 2025
Generative Artificial Intelligence in Robotic Manipulation: A Survey
Kun Zhang
Peng Yun
Jun Cen
Junhao Cai
DiDi Zhu
...
Qifeng Chen
Jia Pan
Wei Zhang
Bo Yang
Hua Chen
661
14
0
05 Mar 2025
A Token-level Text Image Foundation Model for Document Understanding
A Token-level Text Image Foundation Model for Document Understanding
Tongkun Guan
Zining Wang
Pei Fu
Zhengtao Guo
Wei Shen
...
Chen Duan
Hao Sun
Qianyi Jiang
Junfeng Luo
Yunbo Wang
VLM
604
4
0
04 Mar 2025
Reflective Planning: Vision-Language Models for Multi-Stage Long-Horizon Robotic Manipulation
Reflective Planning: Vision-Language Models for Multi-Stage Long-Horizon Robotic Manipulation
Yunhai Feng
Jiaming Han
Zhiyong Yang
Xiangyu Yue
Sergey Levine
Jianlan Luo
LM&Ro
371
25
0
23 Feb 2025
A Comprehensive Survey on Composed Image Retrieval
A Comprehensive Survey on Composed Image Retrieval
Xuemeng Song
Haoqiang Lin
Haokun Wen
Bohan Hou
Mingzhu Xu
Liqiang Nie
479
7
0
19 Feb 2025
Unhackable Temporal Rewarding for Scalable Video MLLMs
Unhackable Temporal Rewarding for Scalable Video MLLMs
En Yu
Kangheng Lin
Liang Zhao
Yana Wei
Zining Zhu
...
Jianjian Sun
Zheng Ge
Xinsong Zhang
Jingyu Wang
Wenbing Tao
286
22
0
17 Feb 2025
Scalable, Training-Free Visual Language Robotics: A Modular Multi-Model Framework for Consumer-Grade GPUs
Scalable, Training-Free Visual Language Robotics: A Modular Multi-Model Framework for Consumer-Grade GPUsIEEE/SICE International Symposium on System Integration (SII), 2025
Marie Samson
Bastien Muraccioli
Fumio Kanehiro
517
4
0
03 Feb 2025
InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for
  Long-term Streaming Video and Audio Interactions
InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for Long-term Streaming Video and Audio Interactions
Pan Zhang
Xiaoyi Dong
Yuhang Cao
Yuhang Zang
Rui Qian
...
Xinsong Zhang
Kai Chen
Yu Qiao
Dahua Lin
Jiaqi Wang
KELM
371
34
0
12 Dec 2024
Neptune: The Long Orbit to Benchmarking Long Video Understanding
Arsha Nagrani
Ruotong Wang
Ramin Mehran
Rachel Hornung
N. B. Gundavarapu
...
Boqing Gong
Cordelia Schmid
Mikhail Sirotenko
Yukun Zhu
Tobias Weyand
446
16
0
12 Dec 2024
DocVLM: Make Your VLM an Efficient Reader
DocVLM: Make Your VLM an Efficient ReaderComputer Vision and Pattern Recognition (CVPR), 2024
Mor Shpigel Nacson
Aviad Aberdam
Roy Ganz
Elad Ben Avraham
Alona Golts
Yair Kittenplon
Shai Mazor
Ron Litman
VLM
650
0
0
11 Dec 2024
CogACT: A Foundational Vision-Language-Action Model for Synergizing
  Cognition and Action in Robotic Manipulation
CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation
Qixiu Li
Yaobo Liang
Zeyu Wang
Lin Luo
Xi Chen
...
Jianmin Bao
Dong Chen
Yuanchun Shi
Jiaolong Yang
B. Guo
LM&Ro
354
187
0
29 Nov 2024
Orthus: Autoregressive Interleaved Image-Text Generation with Modality-Specific Heads
Siqi Kou
Jiachun Jin
Chang Liu
Ye Ma
Jian Jia
Quan Chen
Peng Jiang
Zhijie Deng
Zhijie Deng
DiffMVGenVLM
607
28
0
28 Nov 2024
Heuristic-Free Multi-Teacher Learning
Heuristic-Free Multi-Teacher Learning
Huy Thong Nguyen
En-Hung Chu
Lenord Melvix
Jazon Jiao
Chunglin Wen
Benjamin Louie
360
0
0
19 Nov 2024
EMMA: End-to-End Multimodal Model for Autonomous Driving
EMMA: End-to-End Multimodal Model for Autonomous Driving
Jyh-Jing Hwang
Runsheng Xu
Hubert Lin
Wei-Chih Hung
Jingwei Ji
...
Benjamin Sapp
Yin Zhou
James Guo
Dragomir Anguelov
Mingxing Tan
VLMLM&Ro
433
116
0
30 Oct 2024
PyramidDrop: Accelerating Your Large Vision-Language Models via Pyramid Visual Redundancy Reduction
PyramidDrop: Accelerating Your Large Vision-Language Models via Pyramid Visual Redundancy Reduction
Long Xing
Qidong Huang
Xiaoyi Dong
Jiajie Lu
Pan Zhang
...
Yuhang Cao
Bin Wang
Jiaqi Wang
Feng Wu
Dahua Lin
VLM
331
133
0
22 Oct 2024
ReVLA: Reverting Visual Domain Limitation of Robotic Foundation Models
ReVLA: Reverting Visual Domain Limitation of Robotic Foundation ModelsIEEE International Conference on Robotics and Automation (ICRA), 2024
Sombit Dey
Jan-Nico Zaech
Nikolay Nikolov
Luc Van Gool
Danda Pani Paudel
MoMeVLM
470
16
0
23 Sep 2024
Eagle: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders
Eagle: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders
Min Shi
Fuxiao Liu
Shihao Wang
Shijia Liao
Subhashree Radhakrishnan
...
Andrew Tao
Andrew Tao
Zhiding Yu
Guilin Liu
Guilin Liu
MLLM
402
115
0
28 Aug 2024
xGen-MM (BLIP-3): A Family of Open Large Multimodal Models
xGen-MM (BLIP-3): A Family of Open Large Multimodal Models
Le Xue
Manli Shu
Anas Awadalla
Jun Wang
An Yan
...
Zeyuan Chen
Silvio Savarese
Juan Carlos Niebles
Caiming Xiong
Ran Xu
VLM
526
141
0
16 Aug 2024
VideoQA in the Era of LLMs: An Empirical Study
VideoQA in the Era of LLMs: An Empirical StudyInternational Journal of Computer Vision (IJCV), 2024
Junbin Xiao
Nanxin Huang
Hangyu Qin
Dongyang Li
Yicong Li
...
Zhulin Tao
Jianxing Yu
Liang Lin
Tat-Seng Chua
Angela Yao
350
24
0
08 Aug 2024
VL-TGS: Trajectory Generation and Selection using Vision Language Models in Mapless Outdoor Environments
VL-TGS: Trajectory Generation and Selection using Vision Language Models in Mapless Outdoor EnvironmentsIEEE Robotics and Automation Letters (RA-L), 2024
Daeun Song
Jing Liang
Xuesu Xiao
Dinesh Manocha
572
19
0
05 Aug 2024
On Pre-training of Multimodal Language Models Customized for Chart Understanding
On Pre-training of Multimodal Language Models Customized for Chart Understanding
Wan-Cyuan Fan
Yen-Chun Chen
Xiyang Dai
Lu Yuan
Leonid Sigal
360
10
0
19 Jul 2024
123
Next