Papers citing "PaLI-X: On Scaling up a Multilingual Vision and Language Model"

50 / 101 papers shown

Sigma: The Key for Vision-Language-Action Models toward Telepathic Alignment

Libo Wang

114

30 Nov 2025

Co-Training Vision Language Models for Remote Sensing Multi-task Learning

...

179

26 Nov 2025

CounterVQA: Evaluating and Improving Counterfactual Reasoning in Vision-Language Models for Video Understanding

215

25 Nov 2025

ViPRA: Video Prediction for Robot Actions

231

11 Nov 2025

The Impact of Image Resolution on Biomedical Multimodal Large Language Models

21 Oct 2025

When Seeing Is not Enough: Revealing the Limits of Active Reasoning in MLLMs

156

17 Oct 2025

Prompt-Guided Spatial Understanding with RGB-D Transformers for Fine-Grained Object Relation Reasoning

122

13 Oct 2025

Goal-oriented Backdoor Attack against Vision-Language-Action Models via Physical Objects

134

10 Oct 2025

HAMLET: Switch your Vision-Language-Action Model into a History-Aware Policy

175

01 Oct 2025

POVQA: Preference-Optimized Video Question Answering with Rationales for Data Efficiency

144

01 Oct 2025

Multilingual Vision-Language Models, A Survey

Andrei-Alexandru Manea

Jindřich Libovický

VLM

143

26 Sep 2025

Pyramid Token Pruning for High-Resolution Large Vision-Language Models via Region, Token, and Instruction-Guided Importance

152

19 Sep 2025

Self-Improving Embodied Foundation Models

Seyed Kamyar Seyed Ghasemipour

148

18 Sep 2025

Align-Then-stEer: Adapting the Vision-Language Action Models through Unified Latent Guidance

221

02 Sep 2025

OpenVision 2: A Family of Generative Pretrained Visual Encoders for Multimodal Learning

201

01 Sep 2025

Large VLM-based Vision-Language-Action Models for Robotic Manipulation: A Survey

247

18 Aug 2025

FedVLA: Federated Vision-Language-Action Learning with Dual Gating Mixture-of-Experts for Robotic Manipulation

141

04 Aug 2025

GR-3 Technical Report

...

320

21 Jul 2025

CronusVLA: Towards Efficient and Robust Manipulation via Multi-Frame Vision-Language-Action Modeling

...

200

24 Jun 2025

Adapting Vision-Language Models for Evaluating World Models

189

22 Jun 2025

Vision Generalist Model: A SurveyInternational Journal of Computer Vision (IJCV), 2025

...

293

11 Jun 2025

Sensory-Motor Control with Large Language Models via Iterative Policy Refinement

J. Carvalho

S. Nolfi

LM&Ro

362

05 Jun 2025

SEM: Enhancing Spatial Understanding for Robust Robot Manipulation

302

22 May 2025

Traveling Across Languages: Benchmarking Cross-Lingual Consistency in Multimodal LLMs

474

21 May 2025

Behind Maya: Building a Multilingual Vision Language Model

Nahid Alam

Karthik Reddy Kanjula

Surya Guthikonda

Timothy Chung

Bala Krishna S Vegesna

...

301

13 May 2025

A0: An Affordance-Aware Hierarchical Model for General Robotic Manipulation

...

620

17 Apr 2025

Multimodal Fusion and Vision-Language Models: A Survey for Robot VisionInformation Fusion (Inf. Fusion), 2025

...

439

03 Apr 2025

RoboFlamingo-Plus: Fusion of Depth and RGB Perception with Vision-Language Models for Enhanced Robotic ManipulationInternational Conference on Real-time Computing and Robotics (ICRCR), 2025

Sheng Wang

VLM

337

25 Mar 2025

Aligning Vision to Language: Annotation-Free Multimodal Knowledge Graph Construction for Enhanced LLMs Reasoning

366

17 Mar 2025

Perceiving, Reasoning, Adapting: A Dual-Layer Framework for VLM-Guided Precision Robotic Manipulation

219

07 Mar 2025

Generative Artificial Intelligence in Robotic Manipulation: A Survey

...

661

05 Mar 2025

A Token-level Text Image Foundation Model for Document Understanding

...

604

04 Mar 2025

Reflective Planning: Vision-Language Models for Multi-Stage Long-Horizon Robotic Manipulation

371

23 Feb 2025

A Comprehensive Survey on Composed Image Retrieval

479

19 Feb 2025

Unhackable Temporal Rewarding for Scalable Video MLLMs

...

286

17 Feb 2025

Scalable, Training-Free Visual Language Robotics: A Modular Multi-Model Framework for Consumer-Grade GPUsIEEE/SICE International Symposium on System Integration (SII), 2025

Marie Samson

Bastien Muraccioli

Fumio Kanehiro

517

03 Feb 2025

InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for Long-term Streaming Video and Audio Interactions

...

371

12 Dec 2024

Neptune: The Long Orbit to Benchmarking Long Video Understanding

...

446

12 Dec 2024

DocVLM: Make Your VLM an Efficient ReaderComputer Vision and Pattern Recognition (CVPR), 2024

650

11 Dec 2024

CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation

...

354

187

29 Nov 2024

Orthus: Autoregressive Interleaved Image-Text Generation with Modality-Specific Heads

607

28 Nov 2024

Heuristic-Free Multi-Teacher Learning

360

19 Nov 2024

EMMA: End-to-End Multimodal Model for Autonomous Driving

...

433

116

30 Oct 2024

PyramidDrop: Accelerating Your Large Vision-Language Models via Pyramid Visual Redundancy Reduction

...

Yuhang Cao

Jiaqi Wang

331

133

22 Oct 2024

ReVLA: Reverting Visual Domain Limitation of Robotic Foundation ModelsIEEE International Conference on Robotics and Automation (ICRA), 2024

470

23 Sep 2024

Eagle: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders

Subhashree Radhakrishnan

...

402

115

28 Aug 2024

xGen-MM (BLIP-3): A Family of Open Large Multimodal Models

...

526

141

16 Aug 2024

VideoQA in the Era of LLMs: An Empirical StudyInternational Journal of Computer Vision (IJCV), 2024

...

350

08 Aug 2024

VL-TGS: Trajectory Generation and Selection using Vision Language Models in Mapless Outdoor EnvironmentsIEEE Robotics and Automation Letters (RA-L), 2024

Daeun Song

Jing Liang

Xuesu Xiao

Dinesh Manocha

572

05 Aug 2024

On Pre-training of Multimodal Language Models Customized for Chart Understanding

360

19 Jul 2024