ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2026 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2509.01563
304
15
v1v2v3 (latest)

Kwai Keye-VL 1.5 Technical Report

1 September 2025
Biao Yang
Bin Wen
Boyang Ding
Changyi Liu
Chenglong Chu
Chengru Song
Chongling Rao
Chuan Yi
Da Li
Dunju Zang
Fan Yang
Guorui Zhou
Guowang Zhang
Han Shen
Hao Peng
Haojie Ding
Hao Wang
Haonan Fan
Hengrui Ju
J. Huang
Jiangxia Cao
Jiankang Chen
Jingyun Hua
Kaibing Chen
Kaiyu Jiang
Kaiyu Tang
Kun Gai
Muhao Wei
Qiang Wang
R. Wang
Sen Na
Shengnan Zhang
Siyang Mao
Sui Huang
Tianke Zhang
Tingting Gao
Wei Chen
Wei Yuan
X. Wu
Xiao-Mei Hu
Xingyu Lu
Yi-Fan Zhang
Yiping Yang
Yulong Chen
Z. Lu
Zhenhua Wu
Zhixin Ling
Zhuoran Yang
Z. Li
Di Xu
Haixuan Gao
Xue Yang
Jing Wang
Lejian Ren
Qigen Hu
Qianqian Wang
S. Wang
X. Luo
Yan Li
Yuhang Hu
    VLM
ArXiv (abs)PDFHTMLHuggingFace (31 upvotes)Github (621★)
Main:26 Pages
12 Figures
Bibliography:2 Pages
9 Tables
Appendix:5 Pages
Abstract

In recent years, the development of Large Language Models (LLMs) has significantly advanced, extending their capabilities to multimodal tasks through Multimodal Large Language Models (MLLMs). However, video understanding remains a challenging area due to the dynamic and information-dense nature of videos. Existing models struggle with the trade-off between spatial resolution and temporal coverage when processing video content. We present Keye-VL-1.5, which addresses fundamental challenges in video comprehension through three key innovations. First, we introduce a novel Slow-Fast video encoding strategy that dynamically allocates computational resources based on inter-frame similarity, processing key frames with significant visual changes at higher resolution (Slow pathway) while handling relatively static frames with increased temporal coverage at lower resolution (Fast pathway). Second, we implement a progressive four-stage pre-training methodology that systematically extends the model's context length from 8K to 128K tokens, enabling processing of longer videos and more complex visual content. Third, we develop a comprehensive post-training pipeline focusing on reasoning enhancement and human preference alignment, incorporating a 5-step chain-of-thought data construction process, iterative GSPO-based reinforcement learning with progressive prompt hinting for difficult cases, and alignment training. Through extensive evaluation on public benchmarks and rigorous internal human assessment, Keye-VL-1.5 demonstrates significant improvements over existing models, particularly excelling in video understanding tasks while maintaining competitive performance on general multimodal benchmarks.

View on arXiv
Comments on this paper