ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2026 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2307.06304
  4. Cited By
Patch n' Pack: NaViT, a Vision Transformer for any Aspect Ratio and
  Resolution

Patch n' Pack: NaViT, a Vision Transformer for any Aspect Ratio and Resolution

Neural Information Processing Systems (NeurIPS), 2023
12 July 2023
Mostafa Dehghani
Basil Mustafa
Josip Djolonga
Jonathan Heek
Matthias Minderer
Mathilde Caron
Andreas Steiner
J. Puigcerver
Robert Geirhos
Ibrahim Alabdulmohsin
Avital Oliver
Piotr Padlewski
A. Gritsenko
Mario Luvcić
N. Houlsby
    ViT
ArXiv (abs)PDFHTMLHuggingFace (31 upvotes)

Papers citing "Patch n' Pack: NaViT, a Vision Transformer for any Aspect Ratio and Resolution"

50 / 120 papers shown
Jina-VLM: Small Multilingual Vision Language Model
Jina-VLM: Small Multilingual Vision Language Model
Andreas Koukounas
Georgios Mastrapas
Florian Hönicke
Sedigheh Eslami
Guillaume Roncari
Scott Martens
Han Xiao
MLLM
336
0
0
03 Dec 2025
Spatiotemporal Pyramid Flow Matching for Climate Emulation
Spatiotemporal Pyramid Flow Matching for Climate Emulation
Jeremy Irvin
Jiaqi Han
Z. Wang
Abdulaziz Alharbi
Yufei Zhao
Nomin-Erdene Bayarsaikhan
Daniele Visioni
A. Ng
Duncan Watson-Parris
AI4TS
84
0
0
01 Dec 2025
ReasonEdit: Towards Reasoning-Enhanced Image Editing Models
ReasonEdit: Towards Reasoning-Enhanced Image Editing Models
Fukun Yin
Shiyu Liu
Yucheng Han
Zhibo Wang
Peng Xing
...
Pengtao Chen
Xiangyu Zhang
Daxin Jiang
Xianfang Zeng
Gang Yu
DiffMKELMLRM
237
0
0
27 Nov 2025
Re-Key-Free, Risky-Free: Adaptable Model Usage Control
Re-Key-Free, Risky-Free: Adaptable Model Usage Control
Zihan Wang
Zhongkui Ma
Xinguo Feng
Chuan Yan
Dongge Liu
Ruoxi Sun
Derui Wang
Minhui Xue
Guangdong Bai
AAML
165
0
0
24 Nov 2025
Striking the Right Balance between Compute and Copy: Improving LLM Inferencing Under Speculative Decoding
Striking the Right Balance between Compute and Copy: Improving LLM Inferencing Under Speculative Decoding
Arun Ramachandran
Ramaswamy Govindarajan
M. Annavaram
Prakash Raghavendra
Hossein Entezari Zarch
Lei Gao
Chaoyi Jiang
148
0
0
15 Nov 2025
Application of Graph Based Vision Transformers Architectures for Accurate Temperature Prediction in Fiber Specklegram Sensors
Application of Graph Based Vision Transformers Architectures for Accurate Temperature Prediction in Fiber Specklegram Sensors
Abhishek Sebastian
141
0
0
15 Nov 2025
LightFusion: A Light-weighted, Double Fusion Framework for Unified Multimodal Understanding and Generation
LightFusion: A Light-weighted, Double Fusion Framework for Unified Multimodal Understanding and Generation
Zeyu Wang
Z. Chen
Chenhui Gou
Feng Li
Chaorui Deng
...
Kunchang Li
Weihao Yu
Haoqin Tu
Haoqi Fan
Cihang Xie
359
0
0
27 Oct 2025
CARES: Context-Aware Resolution Selector for VLMs
CARES: Context-Aware Resolution Selector for VLMs
Moshe Kimhi
Nimrod Shabtay
Raja Giryes
Chaim Baskin
Eli Schwartz
VLM
120
0
0
22 Oct 2025
SCoPE VLM: Selective Context Processing for Efficient Document Navigation in Vision-Language Models
SCoPE VLM: Selective Context Processing for Efficient Document Navigation in Vision-Language Models
Gyubeum Lim
Yemo Koo
Vijay Krishna Madisetti
100
0
0
22 Oct 2025
DeepSeek-OCR: Contexts Optical Compression
DeepSeek-OCR: Contexts Optical Compression
Haoran Wei
Yaofeng Sun
Yukun Li
VLM
232
25
0
21 Oct 2025
Accelerating Vision Transformers with Adaptive Patch Sizes
Accelerating Vision Transformers with Adaptive Patch Sizes
Rohan Choudhury
JungEun Kim
Jeongseok Lee
Eunho Yang
László A. Jeni
Kishore Venkateshan
ViT
116
1
0
20 Oct 2025
StretchySnake: Flexible SSM Training Unlocks Action Recognition Across Spatio-Temporal Scales
StretchySnake: Flexible SSM Training Unlocks Action Recognition Across Spatio-Temporal Scales
Nyle Siddiqui
Rohit Gupta
S. Swetha
Mubarak Shah
152
0
0
17 Oct 2025
Task-Aware Resolution Optimization for Visual Large Language Models
Task-Aware Resolution Optimization for Visual Large Language Models
Weiqing Luo
Zhen Tan
Y. Li
Xinyu Zhao
Kwonjoon Lee
Behzad Dariush
Tianlong Chen
76
0
0
10 Oct 2025
UniMMVSR: A Unified Multi-Modal Framework for Cascaded Video Super-Resolution
UniMMVSR: A Unified Multi-Modal Framework for Cascaded Video Super-Resolution
Shian Du
Menghan Xia
Chang-rui Liu
Quande Liu
Xintao Wang
Pengfei Wan
Xiangyang Ji
VGenSupR
272
0
0
09 Oct 2025
PatchVSR: Breaking Video Diffusion Resolution Limits with Patch-wise Video Super-Resolution
PatchVSR: Breaking Video Diffusion Resolution Limits with Patch-wise Video Super-ResolutionComputer Vision and Pattern Recognition (CVPR), 2025
S. Du
Menghan Xia
Chang Liu
Xintao Wang
Jing Wang
Pengfei Wan
Di Zhang
Xiangyang Ji
DiffMSupRVGen
287
3
0
30 Sep 2025
MGM-Omni: Scaling Omni LLMs to Personalized Long-Horizon Speech
MGM-Omni: Scaling Omni LLMs to Personalized Long-Horizon Speech
Chengyao Wang
Zhisheng Zhong
Bohao Peng
Senqiao Yang
Yuqi Liu
Haokun Gui
Bin Xia
Jingyao Li
Bei Yu
Jiaya Jia
MLLMAuLLMVLM
159
2
0
29 Sep 2025
DentVLM: A Multimodal Vision-Language Model for Comprehensive Dental Diagnosis and Enhanced Clinical Practice
DentVLM: A Multimodal Vision-Language Model for Comprehensive Dental Diagnosis and Enhanced Clinical Practice
Zijie Meng
Jin Hao
Xiwei Dai
Yang Feng
Jiaxiang Liu
...
Lunguo Xia
B. Fang
Jimeng Sun
Jian Wu
Zuozhu Liu
LM&MA
122
4
0
27 Sep 2025
Multilingual Vision-Language Models, A Survey
Multilingual Vision-Language Models, A Survey
Andrei-Alexandru Manea
Jindřich Libovický
VLM
143
1
0
26 Sep 2025
Revisiting Data Challenges of Computational Pathology: A Pack-based Multiple Instance Learning Training Framework
Revisiting Data Challenges of Computational Pathology: A Pack-based Multiple Instance Learning Training Framework
Wenhao Tang
Heng Fang
Ge Wu
Xiang Li
Ming-Ming Cheng
191
0
0
25 Sep 2025
PMRT: A Training Recipe for Fast, 3D High-Resolution Aerodynamic Prediction
PMRT: A Training Recipe for Fast, 3D High-Resolution Aerodynamic Prediction
Sam Jacob Jacob
Markus Mrosek
C. Othmer
Harald Köstler
DiffMAI4CE
131
0
0
21 Sep 2025
Lynx: Towards High-Fidelity Personalized Video Generation
Lynx: Towards High-Fidelity Personalized Video Generation
S. Sang
Tiancheng Zhi
Tianpei Gu
Jing Liu
Linjie Luo
DiffMVGen
208
3
0
19 Sep 2025
Qianfan-VL: Domain-Enhanced Universal Vision-Language Models
Qianfan-VL: Domain-Enhanced Universal Vision-Language Models
Daxiang Dong
Mingming Zheng
Dong Xu
Bairong Zhuang
W. Zhang
...
Ruchang Yao
Ziye Yuan
J. Wu
Guangjun Xie
Dou Shen
VLM
95
1
0
19 Sep 2025
AToken: A Unified Tokenizer for Vision
AToken: A Unified Tokenizer for Vision
Jiasen Lu
Liangchen Song
Mingze Xu
Byeongjoo Ahn
Yanjun Wang
Chen Chen
Afshin Dehghan
Yinfei Yang
ViT
236
7
0
17 Sep 2025
MindVL: Towards Efficient and Effective Training of Multimodal Large Language Models on Ascend NPUs
MindVL: Towards Efficient and Effective Training of Multimodal Large Language Models on Ascend NPUs
Feilong Chen
Y. Liu
Yi Huang
Hao Wang
Miren Tian
Ya-Qi Yu
Minghui Liao
Jihao Wu
MLLMVLM
317
1
0
15 Sep 2025
Reconstruction Alignment Improves Unified Multimodal Models
Reconstruction Alignment Improves Unified Multimodal Models
Ji Xie
Trevor Darrell
Luke Zettlemoyer
Xudong Wang
214
15
0
08 Sep 2025
Kwai Keye-VL 1.5 Technical Report
Kwai Keye-VL 1.5 Technical Report
Biao Yang
Bin Wen
Boyang Ding
Changyi Liu
Chenglong Chu
...
S. Wang
X. Luo
Yan Li
Yuhang Hu
Zixing Zhang
VLM
325
15
0
01 Sep 2025
POINTS-Reader: Distillation-Free Adaptation of Vision-Language Models for Document Conversion
POINTS-Reader: Distillation-Free Adaptation of Vision-Language Models for Document Conversion
Yuan Liu
Zhongyin Zhao
Le Tian
Haicheng Wang
Xubing Ye
...
Zilin Yu
Chuhan Wu
Xiao-bin Zhou
Yang Yu
Jie Zhou
VLM
164
3
0
01 Sep 2025
How Multimodal LLMs Solve Image Tasks: A Lens on Visual Grounding, Task Reasoning, and Answer Decoding
How Multimodal LLMs Solve Image Tasks: A Lens on Visual Grounding, Task Reasoning, and Answer Decoding
Zhuoran Yu
Yong Jae Lee
LRM
96
2
0
27 Aug 2025
Uni-cot: Towards Unified Chain-of-Thought Reasoning Across Text and Vision
Uni-cot: Towards Unified Chain-of-Thought Reasoning Across Text and Vision
Luozheng Qin
Jia Gong
Yuqing Sun
Tianjiao Li
Mengping Yang
Xiaomeng Yang
Chao Qu
Zhiyu Tan
Hao Li
MLLMLRM
213
0
0
07 Aug 2025
Voost: A Unified and Scalable Diffusion Transformer for Bidirectional Virtual Try-On and Try-Off
Voost: A Unified and Scalable Diffusion Transformer for Bidirectional Virtual Try-On and Try-Off
Seungyong Lee
Jeong-gi Kwak
DiffM
237
1
0
06 Aug 2025
Few-Shot Vision-Language Reasoning for Satellite Imagery via Verifiable Rewards
Few-Shot Vision-Language Reasoning for Satellite Imagery via Verifiable Rewards
Aybora Koksal
A. Aydin Alatan
OffRLLRM
169
1
0
29 Jul 2025
ZERO: Industry-ready Vision Foundation Model with Multi-modal Prompts
ZERO: Industry-ready Vision Foundation Model with Multi-modal Prompts
Sangbum Choi
Kyeongryeol Go
Taewoong Jang
ObjDVLM
211
0
0
06 Jul 2025
OmniHuman-1: Rethinking the Scaling-Up of One-Stage Conditioned Human Animation Models
OmniHuman-1: Rethinking the Scaling-Up of One-Stage Conditioned Human Animation Models
Gaojie Lin
Jianwen Jiang
Jiaqi Yang
Zerong Zheng
Chao Liang
DiffMVGen
1.3K
84
0
01 Jul 2025
SeedEdit 3.0: Fast and High-Quality Generative Image Editing
SeedEdit 3.0: Fast and High-Quality Generative Image Editing
Peng Wang
Yichun Shi
Xiaochen Lian
Zhonghua Zhai
Xin Xia
Xuefeng Xiao
Weilin Huang
Jianchao Yang
411
26
0
05 Jun 2025
Native-Resolution Image Synthesis
Native-Resolution Image Synthesis
Zidong Wang
Mengwei He
Xiangyu Yue
Xuming He
Yiyuan Zhang
308
3
0
03 Jun 2025
Many-for-Many: Unify the Training of Multiple Video and Image Generation and Manipulation Tasks
Many-for-Many: Unify the Training of Multiple Video and Image Generation and Manipulation Tasks
Tao Yang
Ruibin Li
Yangming Shi
Yuqi Zhang
Qide Dong
Haoran Cheng
Weiguo Feng
Shilei Wen
Bingyue Peng
Lei Zhang
DiffMVGen
264
0
0
02 Jun 2025
EffiVLM-BENCH: A Comprehensive Benchmark for Evaluating Training-Free Acceleration in Large Vision-Language Models
EffiVLM-BENCH: A Comprehensive Benchmark for Evaluating Training-Free Acceleration in Large Vision-Language ModelsAnnual Meeting of the Association for Computational Linguistics (ACL), 2025
Zekun Wang
Minghua Ma
Zexin Wang
Rongchuan Mu
Liping Shan
Ming Liu
Bing Qin
VLM
193
4
0
31 May 2025
Frame In-N-Out: Unbounded Controllable Image-to-Video Generation
Frame In-N-Out: Unbounded Controllable Image-to-Video Generation
Boyang Wang
Xuweiyi Chen
Matheus Gadelha
Zezhou Cheng
DiffMVGen
372
5
0
27 May 2025
Daily-Omni: Towards Audio-Visual Reasoning with Temporal Alignment across Modalities
Ziwei Zhou
Rui Wang
Zuxuan Wu
AuLLMVGen
196
20
0
23 May 2025
TinyRS-R1: Compact Multimodal Language Model for Remote Sensing
TinyRS-R1: Compact Multimodal Language Model for Remote SensingIEEE Geoscience and Remote Sensing Letters (GRSL), 2025
Aybora Koksal
A. Aydin Alatan
LRM
263
1
0
17 May 2025
SAMChat: Introducing Chain of Thought Reasoning and GRPO to a Multimodal Small Language Model for Small Scale Remote Sensing
SAMChat: Introducing Chain of Thought Reasoning and GRPO to a Multimodal Small Language Model for Small Scale Remote SensingIEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing (IEEE J-STARS), 2025
Aybora Koksal
A. Aydin Alatan
LRM
283
5
0
12 May 2025
CM1 - A Dataset for Evaluating Few-Shot Information Extraction with Large Vision Language Models
CM1 - A Dataset for Evaluating Few-Shot Information Extraction with Large Vision Language ModelsIEEE International Conference on Document Analysis and Recognition (ICDAR), 2025
Fabian Wolf
Oliver Tüselmann
Arthur Matei
Lukas Hennies
Christoph Rass
Gernot A. Fink
280
1
0
07 May 2025
Unified Multimodal Understanding and Generation Models: Advances, Challenges, and Opportunities
Unified Multimodal Understanding and Generation Models: Advances, Challenges, and Opportunities
Wei Wei
Jintao Guo
Shanshan Zhao
Minghao Fu
Lunhao Duan
...
Guo-Hua Wang
Qing-Guo Chen
Zhao Xu
Weihua Luo
Kaifu Zhang
DiffM
1.1K
30
0
05 May 2025
Ming-Lite-Uni: Advancements in Unified Architecture for Natural Multimodal Interaction
Ming-Lite-Uni: Advancements in Unified Architecture for Natural Multimodal Interaction
Biao Gong
Cheng Zou
Dandan Zheng
Hu Yu
Jingdong Chen
...
Qingpei Guo
Rui Liu
Weilong Chai
Xinyu Xiao
Ziyuan Huang
MLLM
563
10
0
05 May 2025
OmniV-Med: Scaling Medical Vision-Language Model for Universal Visual Understanding
OmniV-Med: Scaling Medical Vision-Language Model for Universal Visual Understanding
Songtao Jiang
Yuan Wang
Sibo Song
Yanzhe Zhang
Zijie Meng
Bohan Lei
Jian Wu
Jimeng Sun
Zuozhu Liu
MedImVLM
251
11
0
20 Apr 2025
How Well Can General Vision-Language Models Learn Medicine By Watching Public Educational Videos?
How Well Can General Vision-Language Models Learn Medicine By Watching Public Educational Videos?
Rahul Thapa
Andrew Li
Qingyang Wu
Bryan He
Yuki Sahashi
...
Angela Zhang
Ben Athiwaratkun
Shuaiwen Leon Song
David Ouyang
James Zou
LM&MA
484
3
0
19 Apr 2025
OVERLORD: Ultimate Scaling of DataLoader for Multi-Source Large Foundation Model Training
OVERLORD: Ultimate Scaling of DataLoader for Multi-Source Large Foundation Model Training
Juntao Zhao
Qi Lu
Wei Jia
Borui Wan
Lei Zuo
...
Size Zheng
Yanghua Peng
H. Lin
Xin Liu
Chuan Wu
AI4CE
355
1
0
14 Apr 2025
Seaweed-7B: Cost-Effective Training of Video Generation Foundation Model
Seaweed-7B: Cost-Effective Training of Video Generation Foundation Model
Team Seawead
Ceyuan Yang
Zhijie Lin
Yang Zhao
Shanchuan Lin
...
Zuquan Song
Zhenheng Yang
Jiashi Feng
Jianchao Yang
Lu Jiang
DiffM
571
62
0
11 Apr 2025
Kimi-VL Technical Report
Kimi-VL Technical Report
Kimi Team
Angang Du
B. Yin
Bowei Xing
Bowen Qu
...
Longxiang Zhang
Zhe Chen
Zijia Zhao
Ziwei Chen
Zongyu Lin
MLLMVLMMoE
961
139
0
10 Apr 2025
Data Metabolism: An Efficient Data Design Schema For Vision Language Model
Data Metabolism: An Efficient Data Design Schema For Vision Language Model
Jingyuan Zhang
Hongzhi Zhang
Zhou Haonan
Chenxi Sun
Xingguang Ji
Jiakang Wang
Fanheng Kong
Wenshu Fan
Qi Wang
Fuzheng Zhang
VLM
381
2
0
10 Apr 2025
123
Next