ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2026 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2307.06304
  4. Cited By
Patch n' Pack: NaViT, a Vision Transformer for any Aspect Ratio and
  Resolution

Patch n' Pack: NaViT, a Vision Transformer for any Aspect Ratio and Resolution

Neural Information Processing Systems (NeurIPS), 2023
12 July 2023
Mostafa Dehghani
Basil Mustafa
Josip Djolonga
Jonathan Heek
Matthias Minderer
Mathilde Caron
Andreas Steiner
J. Puigcerver
Robert Geirhos
Ibrahim Alabdulmohsin
Avital Oliver
Piotr Padlewski
A. Gritsenko
Mario Luvcić
N. Houlsby
    ViT
ArXiv (abs)PDFHTMLHuggingFace (31 upvotes)

Papers citing "Patch n' Pack: NaViT, a Vision Transformer for any Aspect Ratio and Resolution"

50 / 120 papers shown
Kimi-VL Technical Report
Kimi-VL Technical Report
Kimi Team
Angang Du
B. Yin
Bowei Xing
Bowen Qu
...
Longxiang Zhang
Zhe Chen
Zijia Zhao
Ziwei Chen
Zongyu Lin
MLLMVLMMoE
961
139
0
10 Apr 2025
Capybara-OMNI: An Efficient Paradigm for Building Omni-Modal Language Models
Capybara-OMNI: An Efficient Paradigm for Building Omni-Modal Language Models
Xingguang Ji
Jiakang Wang
Hongzhi Zhang
Jingyuan Zhang
Haonan Zhou
Chenxi Sun
Wenshu Fan
Qi Wang
Fuzheng Zhang
MLLMVLM
302
1
0
10 Apr 2025
SapiensID: Foundation for Human Recognition
SapiensID: Foundation for Human RecognitionComputer Vision and Pattern Recognition (CVPR), 2025
Minchul Kim
Dingqiang Ye
Yiyang Su
Feng Liu
Xiaoming Liu
CVBMVLM
290
8
0
07 Apr 2025
Charm: The Missing Piece in ViT fine-tuning for Image Aesthetic Assessment
Charm: The Missing Piece in ViT fine-tuning for Image Aesthetic AssessmentComputer Vision and Pattern Recognition (CVPR), 2025
Fatemeh Behrad
Tinne Tuytelaars
Johan Wagemans
ViT
299
3
0
03 Apr 2025
UniViTAR: Unified Vision Transformer with Native Resolution
UniViTAR: Unified Vision Transformer with Native Resolution
Limeng Qiao
Yiyang Gan
Bairui Wang
Jie Qin
Shuang Xu
Siqi Yang
Lin Ma
475
2
0
02 Apr 2025
Navi-plus: Managing Ambiguous GUI Navigation Tasks with Follow-up Questions
Navi-plus: Managing Ambiguous GUI Navigation Tasks with Follow-up Questions
Ziming Cheng
Zhiyuan Huang
Junting Pan
Zhaohui Hou
Mingjie Zhan
384
4
0
31 Mar 2025
Synthetic Video Enhances Physical Fidelity in Video Synthesis
Synthetic Video Enhances Physical Fidelity in Video Synthesis
Qi Zhao
Xingyu Ni
Ziyu Wang
Feng Cheng
Ziyan Yang
Lu Jiang
Bohan Wang
VGen
325
9
0
26 Mar 2025
CoMP: Continual Multimodal Pre-training for Vision Foundation Models
CoMP: Continual Multimodal Pre-training for Vision Foundation Models
Yuxiao Chen
L. Meng
Wujian Peng
Zuxuan Wu
Yu-Gang Jiang
VLM
482
5
0
24 Mar 2025
Vamba: Understanding Hour-Long Videos with Hybrid Mamba-Transformers
Vamba: Understanding Hour-Long Videos with Hybrid Mamba-Transformers
Weiming Ren
Wentao Ma
Huan Yang
Cong Wei
Ge Zhang
Lei Ma
Mamba
304
19
0
14 Mar 2025
Long Context Tuning for Video Generation
Yuwei Guo
Ceyuan Yang
Ziyan Yang
Zhibei Ma
Zhijie Lin
Zhenheng Yang
Dahua Lin
Lu Jiang
DiffMVGen
392
56
0
13 Mar 2025
FlexiDiT: Your Diffusion Transformer Can Easily Generate High-Quality Samples with Less Compute
FlexiDiT: Your Diffusion Transformer Can Easily Generate High-Quality Samples with Less ComputeComputer Vision and Pattern Recognition (CVPR), 2025
Sotiris Anagnostidis
Gregor Bachmann
Yeongmin Kim
Jonas Kohler
Markos Georgopoulos
A. Sanakoyeu
Yuming Du
Albert Pumarola
Ali K. Thabet
Edgar Schönfeld
393
5
0
27 Feb 2025
M2-omni: Advancing Omni-MLLM for Comprehensive Modality Support with Competitive Performance
M2-omni: Advancing Omni-MLLM for Comprehensive Modality Support with Competitive Performance
Qingpei Guo
Kaiyou Song
Zipeng Feng
Ziping Ma
Qinglong Zhang
...
Yunxiao Sun
Tai-WeiChang
Jingdong Chen
Ming Yang
Jun Zhou
MLLMVLM
578
12
0
26 Feb 2025
MLLMs Know Where to Look: Training-free Perception of Small Visual Details with Multimodal LLMs
MLLMs Know Where to Look: Training-free Perception of Small Visual Details with Multimodal LLMsInternational Conference on Learning Representations (ICLR), 2025
Jiarui Zhang
Mahyar Khayatkhoei
P. Chhikara
Filip Ilievski
LRM
287
78
0
24 Feb 2025
FeatSharp: Your Vision Model Features, Sharper
FeatSharp: Your Vision Model Features, Sharper
Mike Ranzinger
Greg Heinrich
Pavlo Molchanov
Jan Kautz
Bryan Catanzaro
Andrew Tao
CLIPVLM
401
3
0
22 Feb 2025
PFDiff: Training-Free Acceleration of Diffusion Models Combining Past and Future Scores
PFDiff: Training-Free Acceleration of Diffusion Models Combining Past and Future ScoresInternational Conference on Learning Representations (ICLR), 2024
Guangyi Wang
Yuren Cai
Lijiang Li
Wei Peng
Songzhi Su
DiffM
306
0
0
21 Feb 2025
Thicker and Quicker: A Jumbo Token for Fast Plain Vision Transformers
Thicker and Quicker: A Jumbo Token for Fast Plain Vision Transformers
A. Fuller
Yousef Yassin
Daniel G. Kyrollos
Evan Shelhamer
James R. Green
479
1
0
20 Feb 2025
CineMaster: A 3D-Aware and Controllable Framework for Cinematic Text-to-Video Generation
CineMaster: A 3D-Aware and Controllable Framework for Cinematic Text-to-Video Generation
Qinghe Wang
Yawen Luo
Xiaoyu Shi
Xu Jia
Huchuan Lu
Tianfan Xue
Xintao Wang
Pengfei Wan
Di Zhang
Kun Gai
DiffMVGen
403
33
0
12 Feb 2025
Prion-ViT: Prions-Inspired Vision Transformers for Temperature prediction with Specklegrams
Prion-ViT: Prions-Inspired Vision Transformers for Temperature prediction with Specklegrams
Abhishek Sebastian
Pragna R
Sonaa Rajagopal
Muralikrishnan Mani
307
2
0
28 Jan 2025
Dataset Decomposition: Faster LLM Training with Variable Sequence Length Curriculum
Dataset Decomposition: Faster LLM Training with Variable Sequence Length CurriculumNeural Information Processing Systems (NeurIPS), 2024
Hadi Pouransari
Chun-Liang Li
Jen-Hao Rick Chang
Pavan Kumar Anasosalu Vasu
Cem Koc
Vaishaal Shankar
Oncel Tuzel
336
23
0
08 Jan 2025
ConceptMaster: Multi-Concept Video Customization on Diffusion Transformer Models Without Test-Time Tuning
ConceptMaster: Multi-Concept Video Customization on Diffusion Transformer Models Without Test-Time Tuning
Yuzhou Huang
Ziyang Yuan
Quande Liu
Qiulin Wang
Xintao Wang
Ruimao Zhang
Pengfei Wan
Di Zhang
Kun Gai
VGenDiffM
407
47
0
08 Jan 2025
Efficient Architectures for High Resolution Vision-Language ModelsInternational Conference on Computational Linguistics (COLING), 2025
Miguel Carvalho
Bruno Martins
MLLMVLM
199
1
0
05 Jan 2025
Aria-UI: Visual Grounding for GUI Instructions
Aria-UI: Visual Grounding for GUI InstructionsAnnual Meeting of the Association for Computational Linguistics (ACL), 2024
Yuhao Yang
Yue Wang
Dongxu Li
Ziyang Luo
Bei Chen
Chenyu Huang
Junnan Li
LM&RoLLMAG
492
94
0
20 Dec 2024
EVLM: Self-Reflective Multimodal Reasoning for Cross-Dimensional Visual Editing
EVLM: Self-Reflective Multimodal Reasoning for Cross-Dimensional Visual Editing
Umar Khalid
Kashif Munir
Hasan Iqbal
Nazanin Rahnavard
Jing Hua
Nazanin Rahnavard
Chen Chen
Victor Zhu
Nazanin Rahnavard
284
0
0
13 Dec 2024
LinGen: Towards High-Resolution Minute-Length Text-to-Video Generation with Linear Computational Complexity
LinGen: Towards High-Resolution Minute-Length Text-to-Video Generation with Linear Computational ComplexityComputer Vision and Pattern Recognition (CVPR), 2024
Hongjie Wang
Chih-Yao Ma
Yen-Cheng Liu
Ji Hou
Tao Xu
...
Peizhao Zhang
Tingbo Hou
Peter Vajda
N. Jha
Xiaoliang Dai
LMTDVGenVLMDiffM
423
27
0
13 Dec 2024
Open-Sora Plan: Open-Source Large Video Generation Model
Bin Lin
Yunyang Ge
Xinhua Cheng
Zongjian Li
Bin Zhu
...
Zhang Pan
Xing Zhou
Shaoling Dong
Yonghong Tian
Li-xin Yuan
VLMVGen
427
191
0
28 Nov 2024
Diffusion Sampling Correction via Approximately 10 Parameters
Diffusion Sampling Correction via Approximately 10 Parameters
Guangyi Wang
Wei Peng
Lijiang Li
Wenyu Chen
Yuren Cai
Songzhi Su
DiffM
444
0
0
10 Nov 2024
Adaptive Aspect Ratios with Patch-Mixup-ViT-based Vehicle ReID
Adaptive Aspect Ratios with Patch-Mixup-ViT-based Vehicle ReID
Mei Qiu
Lauren Christopher
Stanley Y. P. Chien
Lingxi Li
ViT
199
3
0
09 Nov 2024
Don't Look Twice: Faster Video Transformers with Run-Length Tokenization
Don't Look Twice: Faster Video Transformers with Run-Length TokenizationNeural Information Processing Systems (NeurIPS), 2024
Rohan Choudhury
Guanglei Zhu
Sihan Liu
Koichiro Niinuma
Kishore Venkateshan
László A. Jeni
248
27
0
07 Nov 2024
Context-Aware Token Selection and Packing for Enhanced Vision
  Transformer
Context-Aware Token Selection and Packing for Enhanced Vision Transformer
Tianyi Zhang
B. Li
Jae-sun Seo
Yu Cao
175
1
0
31 Oct 2024
FiTv2: Scalable and Improved Flexible Vision Transformer for Diffusion
  Model
FiTv2: Scalable and Improved Flexible Vision Transformer for Diffusion Model
ZiDong Wang
Zeyu Lu
Di Huang
Cai Zhou
Wanli Ouyang
and Lei Bai
285
9
0
17 Oct 2024
Locality Alignment Improves Vision-Language Models
Locality Alignment Improves Vision-Language ModelsInternational Conference on Learning Representations (ICLR), 2024
Ian Covert
Tony Sun
James Zou
Tatsunori Hashimoto
VLM
589
11
0
14 Oct 2024
NARAIM: Native Aspect Ratio Autoregressive Image Models
NARAIM: Native Aspect Ratio Autoregressive Image Models
Daniel Gallo Fernández
Robert van der Klis
Rǎzvan-Andrei Matişan
Janusz Partyka
E. Gavves
Samuele Papa
Phillip Lippe
58
0
0
13 Oct 2024
Pixtral 12B
Pixtral 12B
Pravesh Agrawal
Szymon Antoniak
Emma Bou Hanna
Baptiste Bout
Devendra Singh Chaplot
...
Joachim Studnia
Sandeep Subramanian
Sagar Vaze
Thomas Wang
Sophia Yang
VLMMLLM
272
112
0
09 Oct 2024
Aria: An Open Multimodal Native Mixture-of-Experts Model
Aria: An Open Multimodal Native Mixture-of-Experts Model
Dongxu Li
Yudong Liu
Haoning Wu
Yue Wang
Zhiqi Shen
...
Lihuan Zhang
Hanshu Yan
Guoyin Wang
Bei Chen
Junnan Li
MoE
492
114
0
08 Oct 2024
Pyramidal Flow Matching for Efficient Video Generative Modeling
Pyramidal Flow Matching for Efficient Video Generative ModelingInternational Conference on Learning Representations (ICLR), 2024
Yang Jin
Zhicheng Sun
Ningyuan Li
Kun Xu
K. Xu
...
Nan Zhuang
Quzhe Huang
Yang Song
Yadong Mu
Zhouchen Lin
VGen
513
200
0
08 Oct 2024
HATFormer: Historic Handwritten Arabic Text Recognition with Transformers
HATFormer: Historic Handwritten Arabic Text Recognition with Transformers
Adrian Chan
Anupam Mijar
Mehreen Saeed
Chau-Wai Wong
Akram Khater
648
3
0
03 Oct 2024
FlashMask: Efficient and Rich Mask Extension of FlashAttention
FlashMask: Efficient and Rich Mask Extension of FlashAttentionInternational Conference on Learning Representations (ICLR), 2024
Guoxia Wang
Jinle Zeng
Xiyuan Xiao
Siming Wu
Jiabin Yang
Lujing Zheng
Zeyu Chen
Jiang Bian
Dianhai Yu
Haifeng Wang
763
12
0
02 Oct 2024
MIO: A Foundation Model on Multimodal Tokens
MIO: A Foundation Model on Multimodal Tokens
Zekun Wang
King Zhu
Chunpu Xu
Wangchunshu Zhou
Jiaheng Liu
...
Yuanxing Zhang
Ge Zhang
Ke Xu
Jie Fu
Wenhao Huang
MLLMAuLLM
458
21
0
26 Sep 2024
Oryx MLLM: On-Demand Spatial-Temporal Understanding at Arbitrary Resolution
Oryx MLLM: On-Demand Spatial-Temporal Understanding at Arbitrary ResolutionInternational Conference on Learning Representations (ICLR), 2024
Zuyan Liu
Yuhao Dong
Ziwei Liu
Winston Hu
Jiwen Lu
Yongming Rao
ObjD
605
131
0
19 Sep 2024
Agglomerative Token Clustering
Agglomerative Token ClusteringEuropean Conference on Computer Vision (ECCV), 2024
Joakim Bruslund Haurum
Sergio Escalera
Graham W. Taylor
T. Moeslund
287
7
0
18 Sep 2024
Building and better understanding vision-language models: insights and
  future directions
Building and better understanding vision-language models: insights and future directions
Hugo Laurençon
Andrés Marafioti
Victor Sanh
Léo Tronchon
VLM
317
132
0
22 Aug 2024
CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer
CogVideoX: Text-to-Video Diffusion Models with An Expert TransformerInternational Conference on Learning Representations (ICLR), 2024
Zhuoyi Yang
Jiayan Teng
Wendi Zheng
Ming Ding
Shiyu Huang
...
Weihan Wang
Yean Cheng
Xiaotao Gu
Yuxiao Dong
Jie Tang
DiffMVGen
859
1,293
0
12 Aug 2024
Evaluating Linguistic Capabilities of Multimodal LLMs in the Lens of
  Few-Shot Learning
Evaluating Linguistic Capabilities of Multimodal LLMs in the Lens of Few-Shot Learning
Mustafa Dogan
.Ilker Kesen
Iacer Calixto
Aykut Erdem
Erkut Erdem
LRM
252
2
0
17 Jul 2024
ElasticAST: An Audio Spectrogram Transformer for All Length and
  Resolutions
ElasticAST: An Audio Spectrogram Transformer for All Length and Resolutions
Jiu Feng
Mehmet Hamza Erol
Joon Son Chung
Arda Senocak
213
2
0
11 Jul 2024
The Synergy between Data and Multi-Modal Large Language Models: A Survey
  from Co-Development Perspective
The Synergy between Data and Multi-Modal Large Language Models: A Survey from Co-Development Perspective
Zhen Qin
Daoyuan Chen
Wenhao Zhang
Liuyi Yao
Yilun Huang
Bolin Ding
Yaliang Li
Shuiguang Deng
347
11
0
11 Jul 2024
Study on Aspect Ratio Variability toward Robustness of Vision
  Transformer-based Vehicle Re-identification
Study on Aspect Ratio Variability toward Robustness of Vision Transformer-based Vehicle Re-identification
Mei Qiu
Lauren Christopher
Lingxi Li
ViT
163
1
0
10 Jul 2024
MiraData: A Large-Scale Video Dataset with Long Durations and Structured
  Captions
MiraData: A Large-Scale Video Dataset with Long Durations and Structured Captions
Xuan Ju
Yiming Gao
Zhaoyang Zhang
Ziyang Yuan
Xintao Wang
Ailing Zeng
Yu Xiong
Qiang Xu
Ying Shan
VGen
288
103
0
08 Jul 2024
M5 -- A Diverse Benchmark to Assess the Performance of Large Multimodal
  Models Across Multilingual and Multicultural Vision-Language Tasks
M5 -- A Diverse Benchmark to Assess the Performance of Large Multimodal Models Across Multilingual and Multicultural Vision-Language Tasks
Florian Schneider
Sunayana Sitaram
VLM
253
21
0
04 Jul 2024
Learning to Be a Transformer to Pinpoint Anomalies
Learning to Be a Transformer to Pinpoint Anomalies
Alex Costanzino
Pierluigi Zama Ramirez
Giuseppe Lisanti
Luigi Di Stefano
282
0
0
04 Jul 2024
Data curation via joint example selection further accelerates multimodal
  learning
Data curation via joint example selection further accelerates multimodal learning
Talfan Evans
Nikhil Parthasarathy
Hamza Merzic
Olivier J. Hénaff
301
25
0
25 Jun 2024
Previous
123
Next