Patch n' Pack: NaViT, a Vision Transformer for any Aspect Ratio and Resolution

Neural Information Processing Systems (NeurIPS), 2023

12 July 2023

Ibrahim Alabdulmohsin

ArXiv (abs)PDF HTML HuggingFace (31 upvotes)

Papers citing "Patch n' Pack: NaViT, a Vision Transformer for any Aspect Ratio and Resolution"

50 / 120 papers shown

Jina-VLM: Small Multilingual Vision Language Model

336

03 Dec 2025

Spatiotemporal Pyramid Flow Matching for Climate Emulation

Nomin-Erdene Bayarsaikhan

01 Dec 2025

ReasonEdit: Towards Reasoning-Enhanced Image Editing Models

...

237

27 Nov 2025

Re-Key-Free, Risky-Free: Adaptable Model Usage Control

165

24 Nov 2025

Striking the Right Balance between Compute and Copy: Improving LLM Inferencing Under Speculative Decoding

Arun Ramachandran

Ramaswamy Govindarajan

M. Annavaram

Prakash Raghavendra

Hossein Entezari Zarch

Lei Gao

Chaoyi Jiang

148

15 Nov 2025

Application of Graph Based Vision Transformers Architectures for Accurate Temperature Prediction in Fiber Specklegram Sensors

Abhishek Sebastian

141

15 Nov 2025

LightFusion: A Light-weighted, Double Fusion Framework for Unified Multimodal Understanding and Generation

...

359

27 Oct 2025

CARES: Context-Aware Resolution Selector for VLMs

120

22 Oct 2025

SCoPE VLM: Selective Context Processing for Efficient Document Navigation in Vision-Language Models

Gyubeum Lim

Yemo Koo

Vijay Krishna Madisetti

100

22 Oct 2025

DeepSeek-OCR: Contexts Optical Compression

232

21 Oct 2025

Accelerating Vision Transformers with Adaptive Patch Sizes

116

20 Oct 2025

StretchySnake: Flexible SSM Training Unlocks Action Recognition Across Spatio-Temporal Scales

152

17 Oct 2025

Task-Aware Resolution Optimization for Visual Large Language Models

10 Oct 2025

UniMMVSR: A Unified Multi-Modal Framework for Cascaded Video Super-Resolution

272

09 Oct 2025

PatchVSR: Breaking Video Diffusion Resolution Limits with Patch-wise Video Super-ResolutionComputer Vision and Pattern Recognition (CVPR), 2025

287

30 Sep 2025

MGM-Omni: Scaling Omni LLMs to Personalized Long-Horizon Speech

159

29 Sep 2025

DentVLM: A Multimodal Vision-Language Model for Comprehensive Dental Diagnosis and Enhanced Clinical Practice

...

122

27 Sep 2025

Multilingual Vision-Language Models, A Survey

Andrei-Alexandru Manea

Jindřich Libovický

VLM

143

26 Sep 2025

Revisiting Data Challenges of Computational Pathology: A Pack-based Multiple Instance Learning Training Framework

191

25 Sep 2025

PMRT: A Training Recipe for Fast, 3D High-Resolution Aerodynamic Prediction

131

21 Sep 2025

Lynx: Towards High-Fidelity Personalized Video Generation

208

19 Sep 2025

Qianfan-VL: Domain-Enhanced Universal Vision-Language Models

...

19 Sep 2025

AToken: A Unified Tokenizer for Vision

236

17 Sep 2025

MindVL: Towards Efficient and Effective Training of Multimodal Large Language Models on Ascend NPUs

317

15 Sep 2025

Reconstruction Alignment Improves Unified Multimodal Models

214

08 Sep 2025

Kwai Keye-VL 1.5 Technical Report

...

325

01 Sep 2025

POINTS-Reader: Distillation-Free Adaptation of Vision-Language Models for Document Conversion

...

164

01 Sep 2025

How Multimodal LLMs Solve Image Tasks: A Lens on Visual Grounding, Task Reasoning, and Answer Decoding

Zhuoran Yu

Yong Jae Lee

LRM

27 Aug 2025

Uni-cot: Towards Unified Chain-of-Thought Reasoning Across Text and Vision

213

07 Aug 2025

Voost: A Unified and Scalable Diffusion Transformer for Bidirectional Virtual Try-On and Try-Off

Seungyong Lee

Jeong-gi Kwak

DiffM

237

06 Aug 2025

Few-Shot Vision-Language Reasoning for Satellite Imagery via Verifiable Rewards

Aybora Koksal

A. Aydin Alatan

OffRL LRM

169

29 Jul 2025

ZERO: Industry-ready Vision Foundation Model with Multi-modal Prompts

211

06 Jul 2025

OmniHuman-1: Rethinking the Scaling-Up of One-Stage Conditioned Human Animation Models

1.3K

01 Jul 2025

SeedEdit 3.0: Fast and High-Quality Generative Image Editing

411

05 Jun 2025

Native-Resolution Image Synthesis

308

03 Jun 2025

Many-for-Many: Unify the Training of Multiple Video and Image Generation and Manipulation Tasks

264

02 Jun 2025

EffiVLM-BENCH: A Comprehensive Benchmark for Evaluating Training-Free Acceleration in Large Vision-Language ModelsAnnual Meeting of the Association for Computational Linguistics (ACL), 2025

193

31 May 2025

Frame In-N-Out: Unbounded Controllable Image-to-Video Generation

372

27 May 2025

Daily-Omni: Towards Audio-Visual Reasoning with Temporal Alignment across Modalities

196

23 May 2025

TinyRS-R1: Compact Multimodal Language Model for Remote SensingIEEE Geoscience and Remote Sensing Letters (GRSL), 2025

Aybora Koksal

A. Aydin Alatan

LRM

263

17 May 2025

SAMChat: Introducing Chain of Thought Reasoning and GRPO to a Multimodal Small Language Model for Small Scale Remote SensingIEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing (IEEE J-STARS), 2025

Aybora Koksal

A. Aydin Alatan

LRM

283

12 May 2025

CM1 - A Dataset for Evaluating Few-Shot Information Extraction with Large Vision Language ModelsIEEE International Conference on Document Analysis and Recognition (ICDAR), 2025

280

07 May 2025

Unified Multimodal Understanding and Generation Models: Advances, Challenges, and Opportunities

...

1.1K

05 May 2025

Ming-Lite-Uni: Advancements in Unified Architecture for Natural Multimodal Interaction

...

563

05 May 2025

OmniV-Med: Scaling Medical Vision-Language Model for Universal Visual Understanding

251

20 Apr 2025

How Well Can General Vision-Language Models Learn Medicine By Watching Public Educational Videos?

...

484

19 Apr 2025

OVERLORD: Ultimate Scaling of DataLoader for Multi-Source Large Foundation Model Training

...

355

14 Apr 2025

Seaweed-7B: Cost-Effective Training of Video Generation Foundation Model

...

571

11 Apr 2025

Kimi-VL Technical Report

...

961

139

10 Apr 2025

Data Metabolism: An Efficient Data Design Schema For Vision Language Model

381

10 Apr 2025