Papers citing "PaLI-X: On Scaling up a Multilingual Vision and Language Model"

50 / 101 papers shown

BiVLC: Extending Vision-Language Compositionality Evaluation with Text-to-Image RetrievalNeural Information Processing Systems (NeurIPS), 2024

242

14 Jun 2024

ChatSR: Multimodal Large Language Models for Scientific Formula Discovery

345

08 Jun 2024

Towards Semantic Equivalence of Tokenization in Multimodal LLMInternational Conference on Learning Representations (ICLR), 2024

Xiangtai Li

Hanwang Zhang

575

07 Jun 2024

A Survey of Multimodal Large Language Model from A Data-centric Perspective

...

Conghui He

383

26 May 2024

A Survey on Vision-Language-Action Models for Embodied AI

889

166

23 May 2024

What matters when building vision-language models?Neural Information Processing Systems (NeurIPS), 2024

302

276

03 May 2024

What Foundation Models can Bring for Robot Learning in Manipulation : A Survey

...

900

28 Apr 2024

BRAVE: Broadening the visual encoding of vision-language modelsEuropean Conference on Computer Vision (ECCV), 2024

298

10 Apr 2024

MoReVQA: Exploring Modular Reasoning Models for Video Question Answering

418

09 Apr 2024

IsoBench: Benchmarking Multimodal Foundation Models on Isomorphic Representations

453

01 Apr 2024

MagicLens: Self-Supervised Image Retrieval with Open-Ended Instructions

Siyuan Qiao

298

28 Mar 2024

MineDreamer: Learning to Follow Instructions via Chain-of-Imagination for Simulated-World Control

Yu Qiao

303

18 Mar 2024

Griffon v2: Advancing Multimodal Perception with High-Resolution Scaling and Visual-Language Co-Referring

294

14 Mar 2024

Contrastive Region Guidance: Improving Grounding in Vision-Language Models without Training

David Wan

Jaemin Cho

Elias Stengel-Eskin

Mohit Bansal

VLM ObjD

316

04 Mar 2024

Multimodal ArXiv: A Dataset for Improving Scientific Comprehension of Large Vision-Language Models

Lei Li

Yuqi Wang

Runxin Xu

Peiyi Wang

Xiachong Feng

Lingpeng Kong

Qi Liu

358

01 Mar 2024

Learning by Watching: A Review of Video-based Learning Approaches for Robot ManipulationIEEE Access (IEEE Access), 2024

Chrisantus Eze

Christopher Crick

SSL

466

11 Feb 2024

InkSight: Offline-to-Online Handwriting Conversion by Teaching Vision-Language Models to Read and Write

331

08 Feb 2024

Scaling Up LLM Reviews for Google Ads Content Moderation

...

182

07 Feb 2024

ScreenAI: A Vision-Language Model for UI and Infographics Understanding

846

07 Feb 2024

Time-, Memory- and Parameter-Efficient Visual AdaptationComputer Vision and Pattern Recognition (CVPR), 2024

194

05 Feb 2024

GeReA: Question-Aware Prompt Captions for Knowledge-based Visual Question Answering

259

04 Feb 2024

VIALM: A Survey and Benchmark of Visually Impaired Assistance with Large Models

334

29 Jan 2024

InstructDoc: A Dataset for Zero-Shot Generalization of Visual Document Understanding with InstructionsAAAI Conference on Artificial Intelligence (AAAI), 2024

259

24 Jan 2024

CLIP feature-based randomized control using images and text for multiple tasks and robots

Kazuki Shibata

Hideki Deguchi

Shun Taguchi

276

18 Jan 2024

Exploring the Reasoning Abilities of Multimodal Large Language Models (MLLMs): A Comprehensive Survey on Emerging Trends in Multimodal Reasoning

Jianbo Yuan

Hongxia Yang

311

146

10 Jan 2024

Language-Conditioned Robotic Manipulation with Fast and Slow ThinkingIEEE International Conference on Robotics and Automation (ICRA), 2024

...

228

08 Jan 2024

GPT-4V(ision) is a Generalist Web Agent, if GroundedInternational Conference on Machine Learning (ICML), 2024

Huan Sun

381

403

03 Jan 2024

Lyrics: Boosting Fine-grained Language-Vision Alignment and Comprehension via Semantic-aware Visual Objects

225

08 Dec 2023

Open-sourced Data Ecosystem in Autonomous Driving: the Present and Future

...

Yu Qiao

338

06 Dec 2023

Mismatch Quest: Visual and Textual Feedback for Image-Text MisalignmentEuropean Conference on Computer Vision (ECCV), 2023

Daniel Cohen-Or

240

05 Dec 2023

SARA-RT: Scaling up Robotics Transformers with Self-Adaptive Robust AttentionIEEE International Conference on Robotics and Automation (ICRA), 2023

Isabel Leal

Krzysztof Choromanski

...

219

04 Dec 2023

Leveraging VLM-Based Pipelines to Annotate 3D ObjectsInternational Conference on Machine Learning (ICML), 2023

Rishabh Kabra

Loic Matthey

Alexander Lerchner

Niloy J. Mitra

274

29 Nov 2023

MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGIComputer Vision and Pattern Recognition (CVPR), 2023

...

853

1,620

27 Nov 2023

EVCap: Retrieval-Augmented Image Captioning with External Visual-Name Memory for Open-World ComprehensionComputer Vision and Pattern Recognition (CVPR), 2023

263

27 Nov 2023

Igniting Language Intelligence: The Hitchhiker's Guide From Chain-of-Thought Reasoning to Language Agents

...

Rui Wang

363

20 Nov 2023

Large Language Models for Robotics: A Survey

373

201

13 Nov 2023

Florence-2: Advancing a Unified Representation for a Variety of Vision TasksComputer Vision and Pattern Recognition (CVPR), 2023

Lu Yuan

395

383

10 Nov 2023

OtterHD: A High-Resolution Multi-modality Model

Ziwei Liu

187

07 Nov 2023

CogVLM: Visual Expert for Pretrained Language ModelsNeural Information Processing Systems (NeurIPS), 2023

Weihan Wang

Qingsong Lv

Wenmeng Yu

Wenyi Hong

Ji Qi

...

Bin Xu

Juanzi Li

Yuxiao Dong

Ming Ding

Jie Tang

VLM MLLM

676

714

06 Nov 2023

De-Diffusion Makes Text a Strong Cross-Modal InterfaceComputer Vision and Pattern Recognition (CVPR), 2023

Siyuan Qiao

273

01 Nov 2023

Advances in Embodied Navigation Using Large Language Models: A Survey

759

01 Nov 2023

DOMINO: A Dual-System for Multi-step Visual Language Reasoning

173

04 Oct 2023

Beyond Task Performance: Evaluating and Reducing the Flaws of Large Multimodal Models with In-Context LearningInternational Conference on Learning Representations (ICLR), 2023

428

01 Oct 2023

CausalLM is not optimal for in-context learningInternational Conference on Learning Representations (ICLR), 2023

Jialin Wu

205

14 Aug 2023

RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic ControlConference on Robot Learning (CoRL), 2023

...

578

2,155

28 Jul 2023

Emu: Generative Pretraining in MultimodalityInternational Conference on Learning Representations (ICLR), 2023

Hongcheng Gao

359

155

11 Jul 2023

Dense Video Object Captioning from Disjoint SupervisionInternational Conference on Learning Representations (ICLR), 2023

287

20 Jun 2023

Weakly-Supervised Learning of Visual Relations in Multimodal PretrainingConference on Empirical Methods in Natural Language Processing (EMNLP), 2023

295

23 May 2023

Otter: A Multi-Modal Model with In-Context Instruction TuningIEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2023

Joshua Adrian Cahyono

Jingkang Yang

Yu Qiao

MLLM

519

620

05 May 2023

Subject-driven Text-to-Image Generation via Apprenticeship LearningNeural Information Processing Systems (NeurIPS), 2023

919

227

01 Apr 2023