29
0
v1v2 (latest)

GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning

GLM-V Team
Wenyi Hong
Wenmeng Yu
Xiaotao Gu
Guo Wang
Guobing Gan
Haomiao Tang
Jiale Cheng
Ji Qi
Junhui Ji
Lihang Pan
Shuaiqi Duan
Weihan Wang
Yan Wang
Yean Cheng
Zehai He
Zhe Su
Zhen Yang
Ziyang Pan
Aohan Zeng
Baoxu Wang
Boyan Shi
Changyu Pang
Chenhui Zhang
Da Yin
Fan Yang
Guoqing Chen
Jiazheng Xu
Jiali Chen
Jing Chen
Jinhao Chen
Jinghao Lin
Jinjiang Wang
Junjie Chen
Leqi Lei
Letian Gong
Leyi Pan
Mingzhi Zhang
Qinkai Zheng
Sheng Yang
Shi Zhong
Shiyu Huang
Shuyuan Zhao
Siyan Xue
Shangqin Tu
Shengbiao Meng
Tianshu Zhang
Tianwei Luo
Tianxiang Hao
Wenkai Li
Wei Jia
Xin Lyu
Xuancheng Huang
Yanling Wang
Yadong Xue
Yanfeng Wang
Yifan An
Yifan Du
Yiming Shi
Yiheng Huang
Yilin Niu
Yuan Wang
Yuanchang Yue
Yuchen Li
Yutao Zhang
Yuxuan Zhang
Zhanxiao Du
Zhenyu Hou
Zhao Xue
Zhengxiao Du
Zihan Wang
Peng Zhang
Debing Liu
Bin Xu
Juanzi Li
Minlie Huang
Yuxiao Dong
Jie Tang
Main:18 Pages
17 Figures
Bibliography:3 Pages
2 Tables
Appendix:14 Pages
Abstract

We present GLM-4.1V-Thinking, a vision-language model (VLM) designed to advance general-purpose multimodal understanding and reasoning. In this report, we share our key findings in the development of the reasoning-centric training framework. We first develop a capable vision foundation model with significant potential through large-scale pre-training, which arguably sets the upper bound for the final performance. We then propose Reinforcement Learning with Curriculum Sampling (RLCS) to unlock the full potential of the model, leading to comprehensive capability enhancement across a diverse range of tasks, including STEM problem solving, video understanding, content recognition, coding, grounding, GUI-based agents, and long document understanding. We open-source GLM-4.1V-9B-Thinking, which achieves state-of-the-art performance among models of comparable size. In a comprehensive evaluation across 28 public benchmarks, our model outperforms Qwen2.5-VL-7B on nearly all tasks and achieves comparable or even superior performance on 18 benchmarks relative to the significantly larger Qwen2.5-VL-72B. Notably, GLM-4.1V-9B-Thinking also demonstrates competitive or superior performance compared to closed-source models such as GPT-4o on challenging tasks including long document understanding and STEM reasoning, further underscoring its strong capabilities. Code, models and more information are released atthis https URL.

View on arXiv
@article{team2025_2507.01006,
  title={ GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning },
  author={ GLM-V Team and Wenyi Hong and Wenmeng Yu and Xiaotao Gu and Guo Wang and Guobing Gan and Haomiao Tang and Jiale Cheng and Ji Qi and Junhui Ji and Lihang Pan and Shuaiqi Duan and Weihan Wang and Yan Wang and Yean Cheng and Zehai He and Zhe Su and Zhen Yang and Ziyang Pan and Aohan Zeng and Baoxu Wang and Boyan Shi and Changyu Pang and Chenhui Zhang and Da Yin and Fan Yang and Guoqing Chen and Jiazheng Xu and Jiali Chen and Jing Chen and Jinhao Chen and Jinghao Lin and Jinjiang Wang and Junjie Chen and Leqi Lei and Letian Gong and Leyi Pan and Mingzhi Zhang and Qinkai Zheng and Sheng Yang and Shi Zhong and Shiyu Huang and Shuyuan Zhao and Siyan Xue and Shangqin Tu and Shengbiao Meng and Tianshu Zhang and Tianwei Luo and Tianxiang Hao and Wenkai Li and Wei Jia and Xin Lyu and Xuancheng Huang and Yanling Wang and Yadong Xue and Yanfeng Wang and Yifan An and Yifan Du and Yiming Shi and Yiheng Huang and Yilin Niu and Yuan Wang and Yuanchang Yue and Yuchen Li and Yutao Zhang and Yuxuan Zhang and Zhanxiao Du and Zhenyu Hou and Zhao Xue and Zhengxiao Du and Zihan Wang and Peng Zhang and Debing Liu and Bin Xu and Juanzi Li and Minlie Huang and Yuxiao Dong and Jie Tang },
  journal={arXiv preprint arXiv:2507.01006},
  year={ 2025 }
}
Comments on this paper