ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2305.18565
  4. Cited By
PaLI-X: On Scaling up a Multilingual Vision and Language Model

PaLI-X: On Scaling up a Multilingual Vision and Language Model

29 May 2023
Xi Chen
Josip Djolonga
Piotr Padlewski
Basil Mustafa
Soravit Changpinyo
Jialin Wu
Carlos Riquelme Ruiz
Sebastian Goodman
Xiao Wang
Yi Tay
Siamak Shakeri
Mostafa Dehghani
Daniel M. Salz
Mario Lucic
Michael Tschannen
Arsha Nagrani
Hexiang Hu
Mandar Joshi
Bo Pang
Ceslee Montgomery
Paulina Pietrzyk
Marvin Ritter
A. Piergiovanni
Matthias Minderer
Filip Pavetić
Austin Waters
Gang Li
Ibrahim M. Alabdulmohsin
Lucas Beyer
J. Amelot
Kenton Lee
Andreas Steiner
Yang Li
Daniel Keysers
Anurag Arnab
Yuanzhong Xu
Keran Rong
Alexander Kolesnikov
Mojtaba Seyedhosseini
A. Angelova
Xiaohua Zhai
N. Houlsby
Radu Soricut
    VLM
ArXivPDFHTML

Papers citing "PaLI-X: On Scaling up a Multilingual Vision and Language Model"

50 / 161 papers shown
Title
Understanding and Mitigating Toxicity in Image-Text Pretraining Datasets: A Case Study on LLaVA
Understanding and Mitigating Toxicity in Image-Text Pretraining Datasets: A Case Study on LLaVA
Karthik Reddy Kanjula
Surya Guthikonda
Nahid Alam
Shayekh Bin Islam
14
0
0
09 May 2025
A0: An Affordance-Aware Hierarchical Model for General Robotic Manipulation
A0: An Affordance-Aware Hierarchical Model for General Robotic Manipulation
Rongtao Xu
J. Zhang
Minghao Guo
Youpeng Wen
H. Yang
...
Liqiong Wang
Yuxuan Kuang
Meng Cao
Feng Zheng
Xiaodan Liang
37
2
0
17 Apr 2025
Multimodal LLM Augmented Reasoning for Interpretable Visual Perception Analysis
Multimodal LLM Augmented Reasoning for Interpretable Visual Perception Analysis
Shravan Chaudhari
Trilokya Akula
Yoon Kim
Tom Blake
LRM
40
0
0
16 Apr 2025
Multimodal Fusion and Vision-Language Models: A Survey for Robot Vision
Multimodal Fusion and Vision-Language Models: A Survey for Robot Vision
Xiaofeng Han
Shunpeng Chen
Zenghuang Fu
Zhe Feng
Lue Fan
...
Li Guo
Weiliang Meng
Xiaopeng Zhang
Rongtao Xu
Shibiao Xu
60
0
0
03 Apr 2025
RoboFlamingo-Plus: Fusion of Depth and RGB Perception with Vision-Language Models for Enhanced Robotic Manipulation
RoboFlamingo-Plus: Fusion of Depth and RGB Perception with Vision-Language Models for Enhanced Robotic Manipulation
Sheng Wang
VLM
63
2
0
25 Mar 2025
Aligning Vision to Language: Text-Free Multimodal Knowledge Graph Construction for Enhanced LLMs Reasoning
Aligning Vision to Language: Text-Free Multimodal Knowledge Graph Construction for Enhanced LLMs Reasoning
Junming Liu
Siyuan Meng
Yanting Gao
Song Mao
Pinlong Cai
Guohang Yan
Yirong Chen
Zilin Bian
Botian Shi
Ding Wang
41
1
0
17 Mar 2025
Perceiving, Reasoning, Adapting: A Dual-Layer Framework for VLM-Guided Precision Robotic Manipulation
Qingxuan Jia
Guoqin Tang
Zeyuan Huang
Zixuan Hao
Ning Ji
Shihang
Gang Chen
29
0
0
07 Mar 2025
SHAPE : Self-Improved Visual Preference Alignment by Iteratively Generating Holistic Winner
Kejia Chen
Jiawen Zhang
Jiacong Hu
Jiazhen Yang
Jian Lou
Zunlei Feng
Mingli Song
53
0
0
06 Mar 2025
Generative Artificial Intelligence in Robotic Manipulation: A Survey
Kun Zhang
Peng Yun
Jun Cen
Junhao Cai
DiDi Zhu
...
Qifeng Chen
Jia Pan
Wei K. Zhang
Bo Yang
Hua Chen
55
1
0
05 Mar 2025
A Token-level Text Image Foundation Model for Document Understanding
A Token-level Text Image Foundation Model for Document Understanding
Tongkun Guan
Zining Wang
Pei Fu
Zhengtao Guo
Wei-Ming Shen
...
Chen Duan
Hao Sun
Qianyi Jiang
Junfeng Luo
Xiaokang Yang
VLM
43
0
0
04 Mar 2025
Reflective Planning: Vision-Language Models for Multi-Stage Long-Horizon Robotic Manipulation
Reflective Planning: Vision-Language Models for Multi-Stage Long-Horizon Robotic Manipulation
Yunhai Feng
Jiaming Han
Z. Yang
Xiangyu Yue
Sergey Levine
Jianlan Luo
LM&Ro
37
1
0
23 Feb 2025
A Comprehensive Survey on Composed Image Retrieval
A Comprehensive Survey on Composed Image Retrieval
Xuemeng Song
Haoqiang Lin
Haokun Wen
Bohan Hou
Mingzhu Xu
Liqiang Nie
36
1
0
19 Feb 2025
Unhackable Temporal Rewarding for Scalable Video MLLMs
Unhackable Temporal Rewarding for Scalable Video MLLMs
En Yu
Kangheng Lin
Liang Zhao
Yana Wei
Zining Zhu
...
Jianjian Sun
Zheng Ge
X. Zhang
Jingyu Wang
Wenbing Tao
52
4
0
17 Feb 2025
Scalable, Training-Free Visual Language Robotics: A Modular Multi-Model Framework for Consumer-Grade GPUs
Scalable, Training-Free Visual Language Robotics: A Modular Multi-Model Framework for Consumer-Grade GPUs
Marie Samson
Bastien Muraccioli
Fumio Kanehiro
75
1
0
03 Feb 2025
InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for
  Long-term Streaming Video and Audio Interactions
InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for Long-term Streaming Video and Audio Interactions
Pan Zhang
Xiaoyi Dong
Yuhang Cao
Yuhang Zang
Rui Qian
...
X. Zhang
K. Chen
Yu Qiao
D. Lin
Jiaqi Wang
KELM
81
12
0
12 Dec 2024
Neptune: The Long Orbit to Benchmarking Long Video Understanding
Arsha Nagrani
Mingda Zhang
Ramin Mehran
Rachel Hornung
N. B. Gundavarapu
...
Boqing Gong
Cordelia Schmid
Mikhail Sirotenko
Yukun Zhu
Tobias Weyand
96
4
0
12 Dec 2024
CogACT: A Foundational Vision-Language-Action Model for Synergizing
  Cognition and Action in Robotic Manipulation
CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation
Qixiu Li
Yaobo Liang
Zeyu Wang
Lin Luo
Xi Chen
...
Jianmin Bao
Dong Chen
Yuanchun Shi
Jiaolong Yang
B. Guo
LM&Ro
71
20
0
29 Nov 2024
Orthus: Autoregressive Interleaved Image-Text Generation with Modality-Specific Heads
Siqi Kou
Jiachun Jin
Chang Liu
Ye Ma
Jian Jia
Quan Chen
Peng Jiang
Zhijie Deng
Zhijie Deng
DiffM
VGen
VLM
105
5
0
28 Nov 2024
DriveMLLM: A Benchmark for Spatial Understanding with Multimodal Large
  Language Models in Autonomous Driving
DriveMLLM: A Benchmark for Spatial Understanding with Multimodal Large Language Models in Autonomous Driving
Xianda Guo
Ruijun Zhang
Yiqun Duan
Yuhang He
Chenming Zhang
Shuai Liu
Long Chen
LRM
61
11
0
20 Nov 2024
Heuristic-Free Multi-Teacher Learning
Heuristic-Free Multi-Teacher Learning
Huy Thong Nguyen
En-Hung Chu
Lenord Melvix
Jazon Jiao
Chunglin Wen
Benjamin Louie
62
0
0
19 Nov 2024
Safe Planner: Empowering Safety Awareness in Large Pre-Trained Models
  for Robot Task Planning
Safe Planner: Empowering Safety Awareness in Large Pre-Trained Models for Robot Task Planning
Siyuan Li
Zhe Ma
Feifan Liu
Jiani Lu
Qinqin Xiao
K. Sun
Lingfei Cui
Xirui Yang
P. Liu
Xun Wang
27
0
0
11 Nov 2024
MissionGPT: Mission Planner for Mobile Robot based on Robotics
  Transformer Model
MissionGPT: Mission Planner for Mobile Robot based on Robotics Transformer Model
Vladimir Berman
Artem Bazhenov
Dzmitry Tsetserukou
22
2
0
07 Nov 2024
EMMA: End-to-End Multimodal Model for Autonomous Driving
EMMA: End-to-End Multimodal Model for Autonomous Driving
Jyh-Jing Hwang
Runsheng Xu
Hubert Lin
Wei-Chih Hung
Jingwei Ji
...
Benjamin Sapp
Yin Zhou
James Guo
Dragomir Anguelov
Mingxing Tan
VLM
LM&Ro
36
25
0
30 Oct 2024
PyramidDrop: Accelerating Your Large Vision-Language Models via Pyramid Visual Redundancy Reduction
PyramidDrop: Accelerating Your Large Vision-Language Models via Pyramid Visual Redundancy Reduction
Long Xing
Qidong Huang
Xiaoyi Dong
Jiajie Lu
Pan Zhang
...
Yuhang Cao
Conghui He
Jiaqi Wang
Feng Wu
Dahua Lin
VLM
38
25
0
22 Oct 2024
Towards Optimal Adapter Placement for Efficient Transfer Learning
Towards Optimal Adapter Placement for Efficient Transfer Learning
Aleksandra I. Nowak
Otniel-Bogdan Mercea
Anurag Arnab
Jonas Pfeiffer
Yann N. Dauphin
Utku Evci
13
0
0
21 Oct 2024
Skipping Computations in Multimodal LLMs
Skipping Computations in Multimodal LLMs
Mustafa Shukor
Matthieu Cord
18
2
0
12 Oct 2024
ReVLA: Reverting Visual Domain Limitation of Robotic Foundation Models
ReVLA: Reverting Visual Domain Limitation of Robotic Foundation Models
Sombit Dey
Jan-Nico Zaech
Nikolay Nikolov
Luc Van Gool
Danda Pani Paudel
MoMe
VLM
45
4
0
23 Sep 2024
From Linguistic Giants to Sensory Maestros: A Survey on Cross-Modal
  Reasoning with Large Language Models
From Linguistic Giants to Sensory Maestros: A Survey on Cross-Modal Reasoning with Large Language Models
Shengsheng Qian
Zuyi Zhou
Dizhan Xue
Bing Wang
Changsheng Xu
LRM
34
1
0
19 Sep 2024
Sparks of Artificial General Intelligence(AGI) in Semiconductor Material
  Science: Early Explorations into the Next Frontier of Generative AI-Assisted
  Electron Micrograph Analysis
Sparks of Artificial General Intelligence(AGI) in Semiconductor Material Science: Early Explorations into the Next Frontier of Generative AI-Assisted Electron Micrograph Analysis
Sakhinana Sagar Srinivas
Geethan Sannidhi
Sreeja Gangasani
Chidaksh Ravuru
Venkataramana Runkana
22
0
0
17 Sep 2024
In-Context Imitation Learning via Next-Token Prediction
In-Context Imitation Learning via Next-Token Prediction
Letian Fu
Huang Huang
Gaurav Datta
Lawrence Yunliang Chen
William Chung-Ho Panitch
Fangchen Liu
Hui Li
Ken Goldberg
LM&Ro
29
12
0
28 Aug 2024
Eagle: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders
Eagle: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders
Min Shi
Fuxiao Liu
Shihao Wang
Shijia Liao
Subhashree Radhakrishnan
...
Andrew Tao
Andrew Tao
Zhiding Yu
Guilin Liu
Guilin Liu
MLLM
18
53
0
28 Aug 2024
Building and better understanding vision-language models: insights and
  future directions
Building and better understanding vision-language models: insights and future directions
Hugo Laurençon
Andrés Marafioti
Victor Sanh
Léo Tronchon
VLM
29
60
0
22 Aug 2024
xGen-MM (BLIP-3): A Family of Open Large Multimodal Models
xGen-MM (BLIP-3): A Family of Open Large Multimodal Models
Le Xue
Manli Shu
Anas Awadalla
Jun Wang
An Yan
...
Zeyuan Chen
Silvio Savarese
Juan Carlos Niebles
Caiming Xiong
Ran Xu
VLM
41
91
0
16 Aug 2024
VideoQA in the Era of LLMs: An Empirical Study
VideoQA in the Era of LLMs: An Empirical Study
Junbin Xiao
Nanxin Huang
Hangyu Qin
Dongyang Li
Yicong Li
...
Zhulin Tao
Jianxing Yu
Liang Lin
Tat-Seng Chua
Angela Yao
21
9
0
08 Aug 2024
ArtVLM: Attribute Recognition Through Vision-Based Prefix Language
  Modeling
ArtVLM: Attribute Recognition Through Vision-Based Prefix Language Modeling
William Y. Zhu
Keren Ye
Junjie Ke
Jiahui Yu
Leonidas J. Guibas
P. Milanfar
Feng Yang
35
0
0
07 Aug 2024
VL-TGS: Trajectory Generation and Selection using Vision Language Models in Mapless Outdoor Environments
VL-TGS: Trajectory Generation and Selection using Vision Language Models in Mapless Outdoor Environments
Daeun Song
Jing Liang
Xuesu Xiao
Dinesh Manocha
44
4
0
05 Aug 2024
$VILA^2$: VILA Augmented VILA
VILA2VILA^2VILA2: VILA Augmented VILA
Yunhao Fang
Ligeng Zhu
Yao Lu
Yan Wang
Pavlo Molchanov
Jang Hyun Cho
Marco Pavone
Song Han
Hongxu Yin
VLM
29
7
0
24 Jul 2024
Accelerating Pre-training of Multimodal LLMs via Chain-of-Sight
Accelerating Pre-training of Multimodal LLMs via Chain-of-Sight
Ziyuan Huang
Kaixiang Ji
Biao Gong
Zhiwu Qing
Qinglong Zhang
Kecheng Zheng
Jian Wang
Jingdong Chen
Ming Yang
LRM
29
0
0
22 Jul 2024
On Pre-training of Multimodal Language Models Customized for Chart
  Understanding
On Pre-training of Multimodal Language Models Customized for Chart Understanding
Wan-Cyuan Fan
Yen-Chun Chen
Mengchen Liu
Lu Yuan
Leonid Sigal
36
4
0
19 Jul 2024
Foundation Models for Autonomous Robots in Unstructured Environments
Foundation Models for Autonomous Robots in Unstructured Environments
Hossein Naderi
Alireza Shojaei
Lifu Huang
LM&Ro
34
0
0
19 Jul 2024
Is Your Model Really A Good Math Reasoner? Evaluating Mathematical
  Reasoning with Checklist
Is Your Model Really A Good Math Reasoner? Evaluating Mathematical Reasoning with Checklist
Zihao Zhou
Shudong Liu
Maizhen Ning
Wei Liu
Jindong Wang
Derek F. Wong
Xiaowei Huang
Qiufeng Wang
Kaizhu Huang
ELM
LRM
52
23
0
11 Jul 2024
Extracting Training Data from Document-Based VQA Models
Extracting Training Data from Document-Based VQA Models
Francesco Pinto
N. Rauschmayr
F. Tramèr
Philip H. S. Torr
Federico Tombari
21
3
0
11 Jul 2024
Multi-modal Transfer Learning between Biological Foundation Models
Multi-modal Transfer Learning between Biological Foundation Models
Juan Jose Garau-Luis
Patrick Bordes
Liam Gonzalez
Masa Roller
Bernardo P. de Almeida
...
Stefan Laurent
Jan Grzegorzewski
Maren Lang
Thomas Pierrot
Guillaume Richard
AI4CE
25
1
0
20 Jun 2024
BiVLC: Extending Vision-Language Compositionality Evaluation with
  Text-to-Image Retrieval
BiVLC: Extending Vision-Language Compositionality Evaluation with Text-to-Image Retrieval
Imanol Miranda
Ander Salaberria
Eneko Agirre
Gorka Azkune
CoGe
28
0
0
14 Jun 2024
MLLM-SR: Conversational Symbolic Regression base Multi-Modal Large
  Language Models
MLLM-SR: Conversational Symbolic Regression base Multi-Modal Large Language Models
Yanjie Li
Weijun Li
Lina Yu
Min Wu
Jingyi Liu
Wenqiang Li
Shu Wei
Yusong Deng
OffRL
21
3
0
08 Jun 2024
Towards Semantic Equivalence of Tokenization in Multimodal LLM
Towards Semantic Equivalence of Tokenization in Multimodal LLM
Shengqiong Wu
Hao Fei
Xiangtai Li
Jiayi Ji
Hanwang Zhang
Tat-Seng Chua
Shuicheng Yan
MLLM
55
25
0
07 Jun 2024
StrucTexTv3: An Efficient Vision-Language Model for Text-rich Image
  Perception, Comprehension, and Beyond
StrucTexTv3: An Efficient Vision-Language Model for Text-rich Image Perception, Comprehension, and Beyond
Pengyuan Lyu
Yulin Li
Hao Zhou
Weihong Ma
Xingyu Wan
...
Liang Wu
Chengquan Zhang
Kun Yao
Errui Ding
Jingdong Wang
33
7
0
31 May 2024
X-VILA: Cross-Modality Alignment for Large Language Model
X-VILA: Cross-Modality Alignment for Large Language Model
Hanrong Ye
De-An Huang
Yao Lu
Zhiding Yu
Wei Ping
...
Jan Kautz
Song Han
Dan Xu
Pavlo Molchanov
Hongxu Yin
MLLM
VLM
37
29
0
29 May 2024
The Evolution of Multimodal Model Architectures
The Evolution of Multimodal Model Architectures
S. Wadekar
Abhishek Chaurasia
Aman Chadha
Eugenio Culurciello
41
13
0
28 May 2024
Implicit Multimodal Alignment: On the Generalization of Frozen LLMs to
  Multimodal Inputs
Implicit Multimodal Alignment: On the Generalization of Frozen LLMs to Multimodal Inputs
Mustafa Shukor
Matthieu Cord
56
5
0
26 May 2024
1234
Next