ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2408.16500
32
87

CogVLM2: Visual Language Models for Image and Video Understanding

29 August 2024
Wenyi Hong
Weihan Wang
Ming Ding
Wenmeng Yu
Qingsong Lv
Yan Wang
Yean Cheng
Shiyu Huang
Junhui Ji
Zhao Xue
Lei Zhao
Zhuoyi Yang
Xiaotao Gu
Xiaohan Zhang
Guanyu Feng
Da Yin
Zihan Wang
Ji Qi
Xixuan Song
Peng Zhang
Debing Liu
Bin Xu
Juanzi Li
Yuxiao Dong
Jie Tang
    VLM
    MLLM
ArXivPDFHTML
Abstract

Beginning with VisualGLM and CogVLM, we are continuously exploring VLMs in pursuit of enhanced vision-language fusion, efficient higher-resolution architecture, and broader modalities and applications. Here we propose the CogVLM2 family, a new generation of visual language models for image and video understanding including CogVLM2, CogVLM2-Video and GLM-4V. As an image understanding model, CogVLM2 inherits the visual expert architecture with improved training recipes in both pre-training and post-training stages, supporting input resolution up to 1344×13441344 \times 13441344×1344 pixels. As a video understanding model, CogVLM2-Video integrates multi-frame input with timestamps and proposes automated temporal grounding data construction. Notably, CogVLM2 family has achieved state-of-the-art results on benchmarks like MMBench, MM-Vet, TextVQA, MVBench and VCGBench. All models are open-sourced in https://github.com/THUDM/CogVLM2 and https://github.com/THUDM/GLM-4, contributing to the advancement of the field.

View on arXiv
Comments on this paper