Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs

24 June 2024

Sanghyun Woo

ArXiv (abs)PDF HTML HuggingFace (61 upvotes)

Papers citing "Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs"

50 / 413 papers shown

Swiss Army Knife: Synergizing Biases in Knowledge from Vision Foundation Models for Multi-Task LearningInternational Conference on Learning Representations (ICLR), 2024

Yuxiang Lu

Shengcao Cao

Yu-Xiong Wang

481

18 Oct 2024

PUMA: Empowering Unified MLLM with Multi-granular Visual Generation

Kun Wang

Hongsheng Li

354

17 Oct 2024

γ-

MoD: Exploring Mixture-of-Depth Adaptation for Multimodal Large Language Models

279

17 Oct 2024

Can MLLMs Understand the Deep Implication Behind Chinese Images?Annual Meeting of the Association for Computational Linguistics (ACL), 2024

...

157

17 Oct 2024

Understanding the Role of LLMs in Multimodal Evaluation Benchmarks

Zhaowei Li

211

16 Oct 2024

Automatically Generating Visual Hallucination Test Cases for Multimodal Large Language Models

158

15 Oct 2024

MEV Capture Through Time-Advantaged Arbitrage

252

14 Oct 2024

Can We Predict Performance of Large Models across Vision-Language Tasks?

437

14 Oct 2024

Locality Alignment Improves Vision-Language ModelsInternational Conference on Learning Representations (ICLR), 2024

592

14 Oct 2024

MMCOMPOSITION: Revisiting the Compositionality of Pre-trained Vision-Language Models

Chenliang Xu

235

13 Oct 2024

VLFeedback: A Large-Scale AI Feedback Dataset for Large Vision-Language Models AlignmentConference on Empirical Methods in Natural Language Processing (EMNLP), 2024

416

12 Oct 2024

On the Evaluation of Generative Robotic Simulations

Yanchao Yang

Yi Ma

Huazhe Xu

VGen

283

10 Oct 2024

Emergent Visual Grounding in Large Multimodal Models Without Grounding Supervision

Shengcao Cao

Liang-Yan Gui

Yu Wang

242

10 Oct 2024

Insight Over Sight: Exploring the Vision-Knowledge Conflicts in Multimodal LLMsAnnual Meeting of the Association for Computational Linguistics (ACL), 2024

951

10 Oct 2024

Deciphering Cross-Modal Alignment in Large Vision-Language Models with Modality Integration Rate

Qidong Huang

Xiaoyi Dong

Pan Zhang

Yuhang Zang

Yuhang Cao

Jiaqi Wang

Dahua Lin

Weiming Zhang

Nenghai Yu

186

09 Oct 2024

Treat Visual Tokens as Text? But Your MLLM Only Needs Fewer Efforts to See

Chenliang Xu

249

08 Oct 2024

Mitigating Modality Prior-Induced Hallucinations in Multimodal Large Language Models via Deciphering Attention CausalityInternational Conference on Learning Representations (ICLR), 2024

Guanyu Zhou

Yibo Yan

Xin Zou

Kun Wang

Aiwei Liu

Xuming Hu

229

07 Oct 2024

AuroraCap: Efficient, Performant Video Detailed Captioning and a New BenchmarkInternational Conference on Learning Representations (ICLR), 2024

Christopher D. Manning

3DV

649

04 Oct 2024

Revisit Large-Scale Image-Caption Data in Pre-training Multimodal Foundation Models

Zhengfeng Lai

Vasileios Saveris

Chen Chen

Hong-You Chen

Haotian Zhang

...

Wenze Hu

Zhe Gan

Peter Grasch

Meng Cao

Yinfei Yang

VLM

176

03 Oct 2024

Toward a Holistic Evaluation of Robustness in CLIP ModelsIEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2024

349

02 Oct 2024

MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning

Haotian Zhang

Mingfei Gao

...

Zirui Wang

Yinfei Yang

303

30 Sep 2024

CLIP-MoE: Towards Building Mixture of Experts for CLIP with Diversified Multiplet Upcycling

607

28 Sep 2024

Harnessing Frozen Unimodal Encoders for Flexible Multimodal AlignmentComputer Vision and Pattern Recognition (CVPR), 2024

164

28 Sep 2024

EMOVA: Empowering Language Models to See, Hear and Speak with Vivid EmotionsComputer Vision and Pattern Recognition (CVPR), 2024

Kai Chen

Zhili Liu

...

Jun Yao

433

26 Sep 2024

Phantom of Latent for Large Language and Vision Models

Yong Man Ro

276

23 Sep 2024

A Survey on Multimodal Benchmarks: In the Era of Large AI Models

Lin Li

Guikun Chen

Hanrong Shi

Jun Xiao

Long Chen

346

21 Sep 2024

Oryx MLLM: On-Demand Spatial-Temporal Understanding at Arbitrary ResolutionInternational Conference on Learning Representations (ICLR), 2024

Zuyan Liu

Yuhao Dong

Ziwei Liu

Winston Hu

Jiwen Lu

Yongming Rao

ObjD

608

131

19 Sep 2024

Large Language Models are Strong Audio-Visual Speech Recognition LearnersIEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2024

Umberto Cappellazzo

Minsu Kim

Honglie Chen

Pingchuan Ma

Stavros Petridis

Daniele Falavigna

Alessio Brutti

Maja Pantic

392

18 Sep 2024

NVLM: Open Frontier-Class Multimodal LLMs

Wenliang Dai

Zihan Liu

301

114

17 Sep 2024

POINTS: Improving Your Vision-language Model with Affordable Strategies

261

07 Sep 2024

No Detail Left Behind: Revisiting Self-Retrieval for Fine-Grained Image Captioning

Manu Gaur

Darshan Singh

Makarand Tapaswi

939

04 Sep 2024

Law of Vision Representation in MLLMs

577

29 Aug 2024

A Survey on Evaluation of Multimodal Large Language Models

Jiaxing Huang

Jingyi Zhang

LM&MA ELM LRM

305

28 Aug 2024

Eagle: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders

Subhashree Radhakrishnan

...

403

115

28 Aug 2024

IAA: Inner-Adaptor Architecture Empowers Frozen Large Language Model with Multimodal CapabilitiesAAAI Conference on Artificial Intelligence (AAAI), 2024

475

23 Aug 2024

MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans?International Conference on Learning Representations (ICLR), 2024

Yi-Fan Zhang

Huanyu Zhang

Haochen Tian

Chaoyou Fu

Shuangqing Zhang

...

Qingsong Wen

Zhang Zhang

Liwen Wang

Rong Jin

Tieniu Tan

OffRL

363

134

23 Aug 2024

Show-o: One Single Transformer to Unify Multimodal Understanding and GenerationInternational Conference on Learning Representations (ICLR), 2024

Weihao Wang

Kevin Qinghong Lin

Yuchao Gu

Zhijie Chen

Zhenheng Yang

Mike Zheng Shou

398

439

22 Aug 2024

SEA: Supervised Embedding Alignment for Token-Level Visual-Textual Integration in MLLMs

332

21 Aug 2024

xGen-MM (BLIP-3): A Family of Open Large Multimodal Models

...

526

141

16 Aug 2024

Towards Enhanced Context Awareness with Vision-based Multimodal InterfacesInternational Conference on Human-Computer Interaction with Mobile Devices and Services (MobileHCI), 2024

Yongquan Hu

Wen Hu

Aaron Quigley

14 Aug 2024

mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language ModelsInternational Conference on Learning Representations (ICLR), 2024

Ming Yan

Fei Huang

Jingren Zhou

MLLM VLM

314

230

09 Aug 2024

VITA: Towards Open-Source Interactive Omni Multimodal LLM

...

575

146

09 Aug 2024

LLaVA-OneVision: Easy Visual Task Transfer

Bo Li

Yuanhan Zhang

Dong Guo

Renrui Zhang

Feng Li

Hao Zhang

Kaichen Zhang

Yanwei Li

Ziwei Liu

Chunyuan Li

MLLM SyDa VLM

567

1,767

06 Aug 2024

Diffusion Feedback Helps CLIP See BetterInternational Conference on Learning Representations (ICLR), 2024

Jing Liu

331

29 Jul 2024

VideoGameBunny: Towards vision assistants for video games

Mohammad Reza Taesiri

Cor-Paul Bezemer

VLM MLLM

233

21 Jul 2024

VLMEvalKit: An Open-Source Toolkit for Evaluating Large Multi-Modality Models

...

725

358

16 Jul 2024

The Synergy between Data and Multi-Modal Large Language Models: A Survey from Co-Development Perspective

Bolin Ding

Yaliang Li

Shuiguang Deng

347

11 Jul 2024

HiRes-LLaVA: Restoring Fragmentation Input in High-Resolution Large Vision-Language Models

Runhui Huang

198

11 Jul 2024

HiLight: Technical Report on the Motern AI Video Language Model

126

10 Jul 2024

Video-STaR: Self-Training Enables Video Instruction Tuning with Any Supervision

237

08 Jul 2024