Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs

24 June 2024

Sanghyun Woo

ArXiv (abs)PDF HTML HuggingFace (61 upvotes)

Papers citing "Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs"

50 / 413 papers shown

When Less is Enough: Adaptive Token Reduction for Efficient Image Representation

Eduard Allakhverdov

Elizaveta Goncharova

Andrey Kuznetsov

212

20 Mar 2025

POSTA: A Go-to Framework for Customized Artistic Poster GenerationComputer Vision and Pattern Recognition (CVPR), 2025

300

19 Mar 2025

Visual Position Prompt for MLLM based Visual Grounding

529

19 Mar 2025

Where do Large Vision-Language Models Look at when Answering Questions?

284

18 Mar 2025

Aligning Vision to Language: Annotation-Free Multimodal Knowledge Graph Construction for Enhanced LLMs Reasoning

368

17 Mar 2025

Mitigating Visual Forgetting via Take-along Visual Conditioning for Multi-modal Long CoT ReasoningAnnual Meeting of the Association for Computational Linguistics (ACL), 2025

296

17 Mar 2025

MM-Spatial: Exploring 3D Spatial Understanding in Multimodal LLMs

...

290

17 Mar 2025

R1-VL: Learning to Reason with Multimodal Large Language Models via Step-wise Group Relative Policy Optimization

393

206

17 Mar 2025

BREEN: Bridge Data-Efficient Encoder-Free Multimodal Learning with Learnable Queries

230

16 Mar 2025

TikZero: Zero-Shot Text-Guided Graphics Program Synthesis

Simone Paolo Ponzetto

623

14 Mar 2025

Semantic-Clipping: Efficient Vision-Language Modeling with Semantic-Guidedd Visual Selection

328

14 Mar 2025

Towards Understanding Graphical Perception in Large Multimodal Models

316

13 Mar 2025

4D LangSplat: 4D Language Gaussian Splatting via Multimodal Large Language ModelsComputer Vision and Pattern Recognition (CVPR), 2025

425

13 Mar 2025

VisualWebInstruct: Scaling up Multimodal Instruction Data through Web Search

492

13 Mar 2025

Generative Frame Sampler for Long Video UnderstandingAnnual Meeting of the Association for Computational Linguistics (ACL), 2025

282

12 Mar 2025

Revisiting semi-supervised learning in the era of foundation models

567

12 Mar 2025

Referring to Any Person

932

11 Mar 2025

Oasis: One Image is All You Need for Multimodal Instruction Data Synthesis

462

11 Mar 2025

SegAgent: Exploring Pixel Understanding Capabilities in MLLMs by Imitating Human Annotator TrajectoriesComputer Vision and Pattern Recognition (CVPR), 2025

288

11 Mar 2025

Should VLMs be Pre-trained with Image Data?International Conference on Learning Representations (ICLR), 2025

...

240

10 Mar 2025

Painting with Words: Elevating Detailed Image Captioning with Benchmark and Alignment LearningInternational Conference on Learning Representations (ICLR), 2025

261

10 Mar 2025

When Large Vision-Language Model Meets Large Remote Sensing Imagery: Coarse-to-Fine Text-Guided Token Pruning

484

10 Mar 2025

Filter Images First, Generate Instructions Later: Pre-Instruction Data Selection for Visual Instruction TuningComputer Vision and Pattern Recognition (CVPR), 2025

1.0K

10 Mar 2025

Your Large Vision-Language Model Only Needs A Few Attention Heads For Visual GroundingComputer Vision and Pattern Recognition (CVPR), 2025

300

08 Mar 2025

Multi-Layer Visual Feature Fusion in Multimodal LLMs: Methods, Analysis, and Best PracticesComputer Vision and Pattern Recognition (CVPR), 2025

235

08 Mar 2025

GenieBlue: Integrating both Linguistic and Multimodal Capabilities for Large Language Models on Mobile Devices

...

403

08 Mar 2025

R1-Zero's "Aha Moment" in Visual Reasoning on a 2B Non-SFT Model

395

127

07 Mar 2025

SHAPE : Self-Improved Visual Preference Alignment by Iteratively Generating Holistic Winner

329

06 Mar 2025

See What You Are Told: Visual Attention Sink in Large Multimodal ModelsInternational Conference on Learning Representations (ICLR), 2025

360

05 Mar 2025

Seeing is Understanding: Unlocking Causal Attention into Modality-Mutual Attention for Multimodal LLMs

354

04 Mar 2025

Words or Vision: Do Vision-Language Models Have Blind Faith in Text?Computer Vision and Pattern Recognition (CVPR), 2025

338

04 Mar 2025

A Token-level Text Image Foundation Model for Document Understanding

...

604

04 Mar 2025

Advancing vision-language models in front-end development via data synthesis

209

03 Mar 2025

Why Is Spatial Reasoning Hard for VLMs? An Attention Mechanism Perspective on Focus Areas

447

03 Mar 2025

Quality-Driven Curation of Remote Sensing Vision-Language Data via Learned Scoring Models

294

02 Mar 2025

RoboBrain: A Unified Brain Model for Robotic Manipulation from Abstract to ConcreteComputer Vision and Pattern Recognition (CVPR), 2025

...

484

28 Feb 2025

Chitranuvad: Adapting Multi-Lingual LLMs for Multimodal TranslationConference on Machine Translation (WMT), 2025

890

27 Feb 2025

R2-T2: Re-Routing in Test-Time for Multimodal Mixture-of-Experts

470

27 Feb 2025

Nexus: An Omni-Perceptive And -Interactive Model for Language, Audio, And Vision

...

581

26 Feb 2025

M2-omni: Advancing Omni-MLLM for Comprehensive Modality Support with Competitive Performance

...

588

26 Feb 2025

OmniAlign-V: Towards Enhanced Alignment of MLLMs with Human PreferenceAnnual Meeting of the Association for Computational Linguistics (ACL), 2025

...

380

25 Feb 2025

MMAD: A Comprehensive Benchmark for Multimodal Large Language Models in Industrial Anomaly DetectionInternational Conference on Learning Representations (ICLR), 2024

409

24 Feb 2025

Chitrarth: Bridging Vision and Language for a Billion PeopleIEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2025

583

21 Feb 2025

Contrastive Localized Language-Image Pre-Training

350

20 Feb 2025

Scaling Text-Rich Image Understanding via Code-Guided Synthetic Multimodal Data GenerationAnnual Meeting of the Association for Computational Linguistics (ACL), 2025

...

563

20 Feb 2025

Symmetrical Visual Contrastive Optimization: Aligning Vision-Language Models with Minimal Contrastive ImagesAnnual Meeting of the Association for Computational Linguistics (ACL), 2025

401

19 Feb 2025

Megrez-Omni Technical Report

...

235

19 Feb 2025

SoFar: Language-Grounded Orientation Bridges Spatial Reasoning and Object Manipulation

...

445

18 Feb 2025

Magma: A Foundation Model for Multimodal AI AgentsComputer Vision and Pattern Recognition (CVPR), 2025

...

355

18 Feb 2025

PRISM: Self-Pruning Intrinsic Selection Method for Training-Free Multimodal Data Selection

...

517

17 Feb 2025