ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2504.10462
  4. Cited By
The Scalability of Simplicity: Empirical Analysis of Vision-Language Learning with a Single Transformer

The Scalability of Simplicity: Empirical Analysis of Vision-Language Learning with a Single Transformer

14 April 2025
Weixian Lei
Jiacong Wang
Haochen Wang
Xuelong Li
Jun Hao Liew
Jiashi Feng
Zilong Huang
ArXiv (abs)PDFHTMLHuggingFace (15 upvotes)

Papers citing "The Scalability of Simplicity: Empirical Analysis of Vision-Language Learning with a Single Transformer"

16 / 16 papers shown
Title
Thinking With Bounding Boxes: Enhancing Spatio-Temporal Video Grounding via Reinforcement Fine-Tuning
Thinking With Bounding Boxes: Enhancing Spatio-Temporal Video Grounding via Reinforcement Fine-Tuning
Xin Gu
H. Zhang
Qihang Fan
Jingxuan Niu
Zhipeng Zhang
Libo Zhang
G. Chen
Fan Chen
Longyin Wen
Sijie Zhu
AI4TSLRM
271
0
0
26 Nov 2025
Positional Preservation Embedding for Multimodal Large Language Models
Positional Preservation Embedding for Multimodal Large Language Models
Mouxiao Huang
Borui Jiang
Dehua Zheng
Hailin Hu
Kai Han
Xinghao Chen
VLM
233
0
0
27 Oct 2025
Grasp Any Region: Towards Precise, Contextual Pixel Understanding for Multimodal LLMs
Grasp Any Region: Towards Precise, Contextual Pixel Understanding for Multimodal LLMs
Haochen Wang
Yuhao Wang
Tao Zhang
Yikang Zhou
Yanwei Li
...
Anran Wang
Yunhai Tong
Z. Wang
X. Li
Zhaoxiang Zhang
VLM
169
0
0
21 Oct 2025
RL makes MLLMs see better than SFT
RL makes MLLMs see better than SFT
Junha Song
Sangdoo Yun
Dongyoon Han
Jaegul Choo
Byeongho Heo
OffRL
147
0
0
18 Oct 2025
From Pixels to Words -- Towards Native Vision-Language Primitives at Scale
From Pixels to Words -- Towards Native Vision-Language Primitives at Scale
Haiwen Diao
Mingxuan Li
Silei Wu
Linjun Dai
Xiaohua Wang
Hanming Deng
Lewei Lu
Dahua Lin
Ziwei Liu
VLM
112
0
0
16 Oct 2025
SAIL-Embedding Technical Report: Omni-modal Embedding Foundation Model
SAIL-Embedding Technical Report: Omni-modal Embedding Foundation Model
Lin Lin
Jiefeng Long
Zhihe Wan
Y. Wang
Dingkang Yang
...
Yan Qiu
Haiyang Yu
Xiao Liang
Hongsheng Li
Chao Feng
163
3
0
14 Oct 2025
NaViL: Rethinking Scaling Properties of Native Multimodal Large Language Models under Data Constraints
NaViL: Rethinking Scaling Properties of Native Multimodal Large Language Models under Data Constraints
Changyao Tian
Hao Li
Gen Luo
Xizhou Zhu
Weijie Su
...
Y. Liu
Lewei Lu
Wenhai Wang
Hongsheng Li
Jifeng Dai
89
1
0
09 Oct 2025
SVAC: Scaling Is All You Need For Referring Video Object Segmentation
SVAC: Scaling Is All You Need For Referring Video Object Segmentation
Li Zhang
Haoxiang Gao
Zhihao Zhang
Luoxiao Huang
Tao Zhang
VOS
101
0
0
28 Sep 2025
OneCAT: Decoder-Only Auto-Regressive Model for Unified Understanding and Generation
OneCAT: Decoder-Only Auto-Regressive Model for Unified Understanding and Generation
Han Li
Xinyu Peng
Y. Wang
Zelin Peng
Xin Chen
Rongxiang Weng
Jingang Wang
Xunliang Cai
Wenrui Dai
Hongkai Xiong
MLLMOffRL
270
9
0
03 Sep 2025
Circle-RoPE: Cone-like Decoupled Rotary Positional Embedding for Large Vision-Language Models
Circle-RoPE: Cone-like Decoupled Rotary Positional Embedding for Large Vision-Language Models
Chengcheng Wang
Jianyuan Guo
Hongguang Li
Yuchuan Tian
Ying Nie
Chang Xu
Kai Han
229
3
0
22 May 2025
Prioritizing Image-Related Tokens Enhances Vision-Language Pre-Training
Prioritizing Image-Related Tokens Enhances Vision-Language Pre-Training
Yiran Chen
Yuan Yao
Tong Zhang
Heng Ji
VLM
279
1
0
13 May 2025
Pixel-SAIL: Single Transformer For Pixel-Grounded Understanding
Pixel-SAIL: Single Transformer For Pixel-Grounded Understanding
Tao Zhang
Xuelong Li
Zilong Huang
Yuchen Ren
Weixian Lei
XueQing Deng
Shihao Chen
Shilin Xu
Jiashi Feng
MLLMLRM
283
17
0
14 Apr 2025
Infinity-MM: Scaling Multimodal Performance with Large-Scale and High-Quality Instruction Data
Infinity-MM: Scaling Multimodal Performance with Large-Scale and High-Quality Instruction Data
Shuhao Gu
Jialing Zhang
Siyuan Zhou
Kevin Yu
Zhaohu Xing
...
Yufeng Cui
Xinlong Wang
Yaoqi Liu
Fangxiang Feng
Guang Liu
SyDaVLMMLLM
335
51
0
24 Oct 2024
Mono-InternVL: Pushing the Boundaries of Monolithic Multimodal Large Language Models with Endogenous Visual Pre-training
Mono-InternVL: Pushing the Boundaries of Monolithic Multimodal Large Language Models with Endogenous Visual Pre-trainingComputer Vision and Pattern Recognition (CVPR), 2024
Gen Luo
Xue Yang
Wenhan Dou
Zhaokai Wang
Jifeng Dai
Jifeng Dai
Yu Qiao
Xizhou Zhu
VLMMLLM
321
64
0
10 Oct 2024
VLMEvalKit: An Open-Source Toolkit for Evaluating Large Multi-Modality Models
VLMEvalKit: An Open-Source Toolkit for Evaluating Large Multi-Modality Models
Haodong Duan
Xinyu Fang
Junming Yang
Xiangyu Zhao
Lin Chen
...
Yuhang Zang
Pan Zhang
Jiaqi Wang
Dahua Lin
Kai Chen
LM&MAVLM
668
334
0
16 Jul 2024
Chameleon: Mixed-Modal Early-Fusion Foundation Models
Chameleon: Mixed-Modal Early-Fusion Foundation Models
Chameleon Team
MLLM
456
588
0
16 May 2024
1