Patch n' Pack: NaViT, a Vision Transformer for any Aspect Ratio and Resolution

Neural Information Processing Systems (NeurIPS), 2023

12 July 2023

Ibrahim Alabdulmohsin

ArXiv (abs)PDF HTML HuggingFace (31 upvotes)

Papers citing "Patch n' Pack: NaViT, a Vision Transformer for any Aspect Ratio and Resolution"

50 / 120 papers shown

Kimi-VL Technical Report

...

961

139

10 Apr 2025

Capybara-OMNI: An Efficient Paradigm for Building Omni-Modal Language Models

302

10 Apr 2025

SapiensID: Foundation for Human RecognitionComputer Vision and Pattern Recognition (CVPR), 2025

290

07 Apr 2025

Charm: The Missing Piece in ViT fine-tuning for Image Aesthetic AssessmentComputer Vision and Pattern Recognition (CVPR), 2025

299

03 Apr 2025

UniViTAR: Unified Vision Transformer with Native Resolution

475

02 Apr 2025

Navi-plus: Managing Ambiguous GUI Navigation Tasks with Follow-up Questions

384

31 Mar 2025

Synthetic Video Enhances Physical Fidelity in Video Synthesis

325

26 Mar 2025

CoMP: Continual Multimodal Pre-training for Vision Foundation Models

482

24 Mar 2025

Vamba: Understanding Hour-Long Videos with Hybrid Mamba-Transformers

304

14 Mar 2025

Long Context Tuning for Video Generation

392

13 Mar 2025

FlexiDiT: Your Diffusion Transformer Can Easily Generate High-Quality Samples with Less ComputeComputer Vision and Pattern Recognition (CVPR), 2025

393

27 Feb 2025

M2-omni: Advancing Omni-MLLM for Comprehensive Modality Support with Competitive Performance

...

578

26 Feb 2025

MLLMs Know Where to Look: Training-free Perception of Small Visual Details with Multimodal LLMsInternational Conference on Learning Representations (ICLR), 2025

287

24 Feb 2025

FeatSharp: Your Vision Model Features, Sharper

401

22 Feb 2025

PFDiff: Training-Free Acceleration of Diffusion Models Combining Past and Future ScoresInternational Conference on Learning Representations (ICLR), 2024

306

21 Feb 2025

Thicker and Quicker: A Jumbo Token for Fast Plain Vision Transformers

479

20 Feb 2025

CineMaster: A 3D-Aware and Controllable Framework for Cinematic Text-to-Video Generation

403

12 Feb 2025

Prion-ViT: Prions-Inspired Vision Transformers for Temperature prediction with Specklegrams

307

28 Jan 2025

Dataset Decomposition: Faster LLM Training with Variable Sequence Length CurriculumNeural Information Processing Systems (NeurIPS), 2024

Hadi Pouransari

Chun-Liang Li

Jen-Hao Rick Chang

Pavan Kumar Anasosalu Vasu

Cem Koc

Vaishaal Shankar

Oncel Tuzel

336

08 Jan 2025

ConceptMaster: Multi-Concept Video Customization on Diffusion Transformer Models Without Test-Time Tuning

407

08 Jan 2025

Efficient Architectures for High Resolution Vision-Language ModelsInternational Conference on Computational Linguistics (COLING), 2025

Miguel Carvalho

Bruno Martins

MLLM VLM

199

05 Jan 2025

Aria-UI: Visual Grounding for GUI InstructionsAnnual Meeting of the Association for Computational Linguistics (ACL), 2024

492

20 Dec 2024

EVLM: Self-Reflective Multimodal Reasoning for Cross-Dimensional Visual Editing

284

13 Dec 2024

LinGen: Towards High-Resolution Minute-Length Text-to-Video Generation with Linear Computational ComplexityComputer Vision and Pattern Recognition (CVPR), 2024

...

423

13 Dec 2024

Open-Sora Plan: Open-Source Large Video Generation Model

...

427

191

28 Nov 2024

Diffusion Sampling Correction via Approximately 10 Parameters

444

10 Nov 2024

Adaptive Aspect Ratios with Patch-Mixup-ViT-based Vehicle ReID

199

09 Nov 2024

Don't Look Twice: Faster Video Transformers with Run-Length TokenizationNeural Information Processing Systems (NeurIPS), 2024

248

07 Nov 2024

Context-Aware Token Selection and Packing for Enhanced Vision Transformer

Tianyi Zhang

B. Li

Jae-sun Seo

Yu Cao

175

31 Oct 2024

FiTv2: Scalable and Improved Flexible Vision Transformer for Diffusion Model

ZiDong Wang

Wanli Ouyang

285

17 Oct 2024

Locality Alignment Improves Vision-Language ModelsInternational Conference on Learning Representations (ICLR), 2024

589

14 Oct 2024

NARAIM: Native Aspect Ratio Autoregressive Image Models

Daniel Gallo Fernández

Robert van der Klis

Rǎzvan-Andrei Matişan

13 Oct 2024

Devendra Singh Chaplot

...

272

112

09 Oct 2024

Aria: An Open Multimodal Native Mixture-of-Experts Model

Dongxu Li

Yudong Liu

Haoning Wu

Yue Wang

Zhiqi Shen

...

Lihuan Zhang

Hanshu Yan

Guoyin Wang

Bei Chen

Junnan Li

MoE

492

114

08 Oct 2024

Pyramidal Flow Matching for Efficient Video Generative ModelingInternational Conference on Learning Representations (ICLR), 2024

Kun Xu

...

Yang Song

513

200

08 Oct 2024

HATFormer: Historic Handwritten Arabic Text Recognition with Transformers

648

03 Oct 2024

FlashMask: Efficient and Rich Mask Extension of FlashAttentionInternational Conference on Learning Representations (ICLR), 2024

763

02 Oct 2024

MIO: A Foundation Model on Multimodal Tokens

...

458

26 Sep 2024

Oryx MLLM: On-Demand Spatial-Temporal Understanding at Arbitrary ResolutionInternational Conference on Learning Representations (ICLR), 2024

Zuyan Liu

Yuhao Dong

Ziwei Liu

Winston Hu

Jiwen Lu

Yongming Rao

ObjD

605

131

19 Sep 2024

Agglomerative Token ClusteringEuropean Conference on Computer Vision (ECCV), 2024

Joakim Bruslund Haurum

Sergio Escalera

Graham W. Taylor

T. Moeslund

287

18 Sep 2024

Building and better understanding vision-language models: insights and future directions

Hugo Laurençon

317

132

22 Aug 2024

CogVideoX: Text-to-Video Diffusion Models with An Expert TransformerInternational Conference on Learning Representations (ICLR), 2024

Zhuoyi Yang

Wendi Zheng

...

Xiaotao Gu

Yuxiao Dong

Jie Tang

DiffM VGen

859

1,293

12 Aug 2024

Evaluating Linguistic Capabilities of Multimodal LLMs in the Lens of Few-Shot Learning

252

17 Jul 2024

ElasticAST: An Audio Spectrogram Transformer for All Length and Resolutions

213

11 Jul 2024

The Synergy between Data and Multi-Modal Large Language Models: A Survey from Co-Development Perspective

Bolin Ding

Yaliang Li

Shuiguang Deng

347

11 Jul 2024

Study on Aspect Ratio Variability toward Robustness of Vision Transformer-based Vehicle Re-identification

163

10 Jul 2024

MiraData: A Large-Scale Video Dataset with Long Durations and Structured Captions

Xuan Ju

Xintao Wang

288

103

08 Jul 2024

M5 -- A Diverse Benchmark to Assess the Performance of Large Multimodal Models Across Multilingual and Multicultural Vision-Language Tasks

Florian Schneider

Sunayana Sitaram

VLM

253

04 Jul 2024

Learning to Be a Transformer to Pinpoint Anomalies

Alex Costanzino

Pierluigi Zama Ramirez

Giuseppe Lisanti

Luigi Di Stefano

282

04 Jul 2024

Data curation via joint example selection further accelerates multimodal learning

301

25 Jun 2024