Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales
Search
Open menu
Home
Papers
2307.06304
Cited By
Patch n' Pack: NaViT, a Vision Transformer for any Aspect Ratio and Resolution
Neural Information Processing Systems (NeurIPS), 2023
12 July 2023
Mostafa Dehghani
Basil Mustafa
Josip Djolonga
Jonathan Heek
Matthias Minderer
Mathilde Caron
Andreas Steiner
J. Puigcerver
Robert Geirhos
Ibrahim Alabdulmohsin
Avital Oliver
Piotr Padlewski
A. Gritsenko
Mario Luvcić
N. Houlsby
ViT
Re-assign community
ArXiv (abs)
PDF
HTML
HuggingFace (31 upvotes)
Papers citing
"Patch n' Pack: NaViT, a Vision Transformer for any Aspect Ratio and Resolution"
50 / 120 papers shown
Kimi-VL Technical Report
Kimi Team
Angang Du
B. Yin
Bowei Xing
Bowen Qu
...
Longxiang Zhang
Zhe Chen
Zijia Zhao
Ziwei Chen
Zongyu Lin
MLLM
VLM
MoE
961
139
0
10 Apr 2025
Capybara-OMNI: An Efficient Paradigm for Building Omni-Modal Language Models
Xingguang Ji
Jiakang Wang
Hongzhi Zhang
Jingyuan Zhang
Haonan Zhou
Chenxi Sun
Wenshu Fan
Qi Wang
Fuzheng Zhang
MLLM
VLM
302
1
0
10 Apr 2025
SapiensID: Foundation for Human Recognition
Computer Vision and Pattern Recognition (CVPR), 2025
Minchul Kim
Dingqiang Ye
Yiyang Su
Feng Liu
Xiaoming Liu
CVBM
VLM
290
8
0
07 Apr 2025
Charm: The Missing Piece in ViT fine-tuning for Image Aesthetic Assessment
Computer Vision and Pattern Recognition (CVPR), 2025
Fatemeh Behrad
Tinne Tuytelaars
Johan Wagemans
ViT
299
3
0
03 Apr 2025
UniViTAR: Unified Vision Transformer with Native Resolution
Limeng Qiao
Yiyang Gan
Bairui Wang
Jie Qin
Shuang Xu
Siqi Yang
Lin Ma
475
2
0
02 Apr 2025
Navi-plus: Managing Ambiguous GUI Navigation Tasks with Follow-up Questions
Ziming Cheng
Zhiyuan Huang
Junting Pan
Zhaohui Hou
Mingjie Zhan
384
4
0
31 Mar 2025
Synthetic Video Enhances Physical Fidelity in Video Synthesis
Qi Zhao
Xingyu Ni
Ziyu Wang
Feng Cheng
Ziyan Yang
Lu Jiang
Bohan Wang
VGen
325
9
0
26 Mar 2025
CoMP: Continual Multimodal Pre-training for Vision Foundation Models
Yuxiao Chen
L. Meng
Wujian Peng
Zuxuan Wu
Yu-Gang Jiang
VLM
482
5
0
24 Mar 2025
Vamba: Understanding Hour-Long Videos with Hybrid Mamba-Transformers
Weiming Ren
Wentao Ma
Huan Yang
Cong Wei
Ge Zhang
Lei Ma
Mamba
304
19
0
14 Mar 2025
Long Context Tuning for Video Generation
Yuwei Guo
Ceyuan Yang
Ziyan Yang
Zhibei Ma
Zhijie Lin
Zhenheng Yang
Dahua Lin
Lu Jiang
DiffM
VGen
392
56
0
13 Mar 2025
FlexiDiT: Your Diffusion Transformer Can Easily Generate High-Quality Samples with Less Compute
Computer Vision and Pattern Recognition (CVPR), 2025
Sotiris Anagnostidis
Gregor Bachmann
Yeongmin Kim
Jonas Kohler
Markos Georgopoulos
A. Sanakoyeu
Yuming Du
Albert Pumarola
Ali K. Thabet
Edgar Schönfeld
393
5
0
27 Feb 2025
M2-omni: Advancing Omni-MLLM for Comprehensive Modality Support with Competitive Performance
Qingpei Guo
Kaiyou Song
Zipeng Feng
Ziping Ma
Qinglong Zhang
...
Yunxiao Sun
Tai-WeiChang
Jingdong Chen
Ming Yang
Jun Zhou
MLLM
VLM
578
12
0
26 Feb 2025
MLLMs Know Where to Look: Training-free Perception of Small Visual Details with Multimodal LLMs
International Conference on Learning Representations (ICLR), 2025
Jiarui Zhang
Mahyar Khayatkhoei
P. Chhikara
Filip Ilievski
LRM
287
78
0
24 Feb 2025
FeatSharp: Your Vision Model Features, Sharper
Mike Ranzinger
Greg Heinrich
Pavlo Molchanov
Jan Kautz
Bryan Catanzaro
Andrew Tao
CLIP
VLM
401
3
0
22 Feb 2025
PFDiff: Training-Free Acceleration of Diffusion Models Combining Past and Future Scores
International Conference on Learning Representations (ICLR), 2024
Guangyi Wang
Yuren Cai
Lijiang Li
Wei Peng
Songzhi Su
DiffM
306
0
0
21 Feb 2025
Thicker and Quicker: A Jumbo Token for Fast Plain Vision Transformers
A. Fuller
Yousef Yassin
Daniel G. Kyrollos
Evan Shelhamer
James R. Green
479
1
0
20 Feb 2025
CineMaster: A 3D-Aware and Controllable Framework for Cinematic Text-to-Video Generation
Qinghe Wang
Yawen Luo
Xiaoyu Shi
Xu Jia
Huchuan Lu
Tianfan Xue
Xintao Wang
Pengfei Wan
Di Zhang
Kun Gai
DiffM
VGen
403
33
0
12 Feb 2025
Prion-ViT: Prions-Inspired Vision Transformers for Temperature prediction with Specklegrams
Abhishek Sebastian
Pragna R
Sonaa Rajagopal
Muralikrishnan Mani
307
2
0
28 Jan 2025
Dataset Decomposition: Faster LLM Training with Variable Sequence Length Curriculum
Neural Information Processing Systems (NeurIPS), 2024
Hadi Pouransari
Chun-Liang Li
Jen-Hao Rick Chang
Pavan Kumar Anasosalu Vasu
Cem Koc
Vaishaal Shankar
Oncel Tuzel
336
23
0
08 Jan 2025
ConceptMaster: Multi-Concept Video Customization on Diffusion Transformer Models Without Test-Time Tuning
Yuzhou Huang
Ziyang Yuan
Quande Liu
Qiulin Wang
Xintao Wang
Ruimao Zhang
Pengfei Wan
Di Zhang
Kun Gai
VGen
DiffM
407
47
0
08 Jan 2025
Efficient Architectures for High Resolution Vision-Language Models
International Conference on Computational Linguistics (COLING), 2025
Miguel Carvalho
Bruno Martins
MLLM
VLM
199
1
0
05 Jan 2025
Aria-UI: Visual Grounding for GUI Instructions
Annual Meeting of the Association for Computational Linguistics (ACL), 2024
Yuhao Yang
Yue Wang
Dongxu Li
Ziyang Luo
Bei Chen
Chenyu Huang
Junnan Li
LM&Ro
LLMAG
492
94
0
20 Dec 2024
EVLM: Self-Reflective Multimodal Reasoning for Cross-Dimensional Visual Editing
Umar Khalid
Kashif Munir
Hasan Iqbal
Nazanin Rahnavard
Jing Hua
Nazanin Rahnavard
Chen Chen
Victor Zhu
Nazanin Rahnavard
284
0
0
13 Dec 2024
LinGen: Towards High-Resolution Minute-Length Text-to-Video Generation with Linear Computational Complexity
Computer Vision and Pattern Recognition (CVPR), 2024
Hongjie Wang
Chih-Yao Ma
Yen-Cheng Liu
Ji Hou
Tao Xu
...
Peizhao Zhang
Tingbo Hou
Peter Vajda
N. Jha
Xiaoliang Dai
LMTD
VGen
VLM
DiffM
423
27
0
13 Dec 2024
Open-Sora Plan: Open-Source Large Video Generation Model
Bin Lin
Yunyang Ge
Xinhua Cheng
Zongjian Li
Bin Zhu
...
Zhang Pan
Xing Zhou
Shaoling Dong
Yonghong Tian
Li-xin Yuan
VLM
VGen
427
191
0
28 Nov 2024
Diffusion Sampling Correction via Approximately 10 Parameters
Guangyi Wang
Wei Peng
Lijiang Li
Wenyu Chen
Yuren Cai
Songzhi Su
DiffM
444
0
0
10 Nov 2024
Adaptive Aspect Ratios with Patch-Mixup-ViT-based Vehicle ReID
Mei Qiu
Lauren Christopher
Stanley Y. P. Chien
Lingxi Li
ViT
199
3
0
09 Nov 2024
Don't Look Twice: Faster Video Transformers with Run-Length Tokenization
Neural Information Processing Systems (NeurIPS), 2024
Rohan Choudhury
Guanglei Zhu
Sihan Liu
Koichiro Niinuma
Kishore Venkateshan
László A. Jeni
248
27
0
07 Nov 2024
Context-Aware Token Selection and Packing for Enhanced Vision Transformer
Tianyi Zhang
B. Li
Jae-sun Seo
Yu Cao
175
1
0
31 Oct 2024
FiTv2: Scalable and Improved Flexible Vision Transformer for Diffusion Model
ZiDong Wang
Zeyu Lu
Di Huang
Cai Zhou
Wanli Ouyang
and Lei Bai
285
9
0
17 Oct 2024
Locality Alignment Improves Vision-Language Models
International Conference on Learning Representations (ICLR), 2024
Ian Covert
Tony Sun
James Zou
Tatsunori Hashimoto
VLM
589
11
0
14 Oct 2024
NARAIM: Native Aspect Ratio Autoregressive Image Models
Daniel Gallo Fernández
Robert van der Klis
Rǎzvan-Andrei Matişan
Janusz Partyka
E. Gavves
Samuele Papa
Phillip Lippe
58
0
0
13 Oct 2024
Pixtral 12B
Pravesh Agrawal
Szymon Antoniak
Emma Bou Hanna
Baptiste Bout
Devendra Singh Chaplot
...
Joachim Studnia
Sandeep Subramanian
Sagar Vaze
Thomas Wang
Sophia Yang
VLM
MLLM
272
112
0
09 Oct 2024
Aria: An Open Multimodal Native Mixture-of-Experts Model
Dongxu Li
Yudong Liu
Haoning Wu
Yue Wang
Zhiqi Shen
...
Lihuan Zhang
Hanshu Yan
Guoyin Wang
Bei Chen
Junnan Li
MoE
492
114
0
08 Oct 2024
Pyramidal Flow Matching for Efficient Video Generative Modeling
International Conference on Learning Representations (ICLR), 2024
Yang Jin
Zhicheng Sun
Ningyuan Li
Kun Xu
K. Xu
...
Nan Zhuang
Quzhe Huang
Yang Song
Yadong Mu
Zhouchen Lin
VGen
513
200
0
08 Oct 2024
HATFormer: Historic Handwritten Arabic Text Recognition with Transformers
Adrian Chan
Anupam Mijar
Mehreen Saeed
Chau-Wai Wong
Akram Khater
648
3
0
03 Oct 2024
FlashMask: Efficient and Rich Mask Extension of FlashAttention
International Conference on Learning Representations (ICLR), 2024
Guoxia Wang
Jinle Zeng
Xiyuan Xiao
Siming Wu
Jiabin Yang
Lujing Zheng
Zeyu Chen
Jiang Bian
Dianhai Yu
Haifeng Wang
763
12
0
02 Oct 2024
MIO: A Foundation Model on Multimodal Tokens
Zekun Wang
King Zhu
Chunpu Xu
Wangchunshu Zhou
Jiaheng Liu
...
Yuanxing Zhang
Ge Zhang
Ke Xu
Jie Fu
Wenhao Huang
MLLM
AuLLM
458
21
0
26 Sep 2024
Oryx MLLM: On-Demand Spatial-Temporal Understanding at Arbitrary Resolution
International Conference on Learning Representations (ICLR), 2024
Zuyan Liu
Yuhao Dong
Ziwei Liu
Winston Hu
Jiwen Lu
Yongming Rao
ObjD
605
131
0
19 Sep 2024
Agglomerative Token Clustering
European Conference on Computer Vision (ECCV), 2024
Joakim Bruslund Haurum
Sergio Escalera
Graham W. Taylor
T. Moeslund
287
7
0
18 Sep 2024
Building and better understanding vision-language models: insights and future directions
Hugo Laurençon
Andrés Marafioti
Victor Sanh
Léo Tronchon
VLM
317
132
0
22 Aug 2024
CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer
International Conference on Learning Representations (ICLR), 2024
Zhuoyi Yang
Jiayan Teng
Wendi Zheng
Ming Ding
Shiyu Huang
...
Weihan Wang
Yean Cheng
Xiaotao Gu
Yuxiao Dong
Jie Tang
DiffM
VGen
859
1,293
0
12 Aug 2024
Evaluating Linguistic Capabilities of Multimodal LLMs in the Lens of Few-Shot Learning
Mustafa Dogan
.Ilker Kesen
Iacer Calixto
Aykut Erdem
Erkut Erdem
LRM
252
2
0
17 Jul 2024
ElasticAST: An Audio Spectrogram Transformer for All Length and Resolutions
Jiu Feng
Mehmet Hamza Erol
Joon Son Chung
Arda Senocak
213
2
0
11 Jul 2024
The Synergy between Data and Multi-Modal Large Language Models: A Survey from Co-Development Perspective
Zhen Qin
Daoyuan Chen
Wenhao Zhang
Liuyi Yao
Yilun Huang
Bolin Ding
Yaliang Li
Shuiguang Deng
347
11
0
11 Jul 2024
Study on Aspect Ratio Variability toward Robustness of Vision Transformer-based Vehicle Re-identification
Mei Qiu
Lauren Christopher
Lingxi Li
ViT
163
1
0
10 Jul 2024
MiraData: A Large-Scale Video Dataset with Long Durations and Structured Captions
Xuan Ju
Yiming Gao
Zhaoyang Zhang
Ziyang Yuan
Xintao Wang
Ailing Zeng
Yu Xiong
Qiang Xu
Ying Shan
VGen
288
103
0
08 Jul 2024
M5 -- A Diverse Benchmark to Assess the Performance of Large Multimodal Models Across Multilingual and Multicultural Vision-Language Tasks
Florian Schneider
Sunayana Sitaram
VLM
253
21
0
04 Jul 2024
Learning to Be a Transformer to Pinpoint Anomalies
Alex Costanzino
Pierluigi Zama Ramirez
Giuseppe Lisanti
Luigi Di Stefano
282
0
0
04 Jul 2024
Data curation via joint example selection further accelerates multimodal learning
Talfan Evans
Nikhil Parthasarathy
Hamza Merzic
Olivier J. Hénaff
301
25
0
25 Jun 2024
Previous
1
2
3
Next