16
185

Valley: Video Assistant with Large Language model Enhanced abilitY

Abstract

Large Language Models (LLMs), with remarkable conversational capability, have emerged as AI assistants that can handle both visual and textual modalities. However, their effectiveness in joint video and language understanding has not been extensively explored. In the paper, we introduce Valley, a multi-modal foundation model that is designed to enable enhanced video comprehension and instruction-following capabilities. To this end, we construct two datasets, namely Valley-702k and Valley-instruct-73k, to cover a diverse range of video-text alignment and video-based instruction tasks, such as multi-shot captions, long video descriptions, action recognition, causal inference, etc. Then, we adopt ViT-L/14 as the vision encoder and explore three different temporal modeling modules to learn multifaceted features for enhanced video understanding. In addition, we implement a two-phase training approach for Valley: the first phase focuses solely on training the projection module to facilitate the LLM's capacity to understand visual input, and the second phase jointly trains the projection module and the LLM to improve their instruction following ability. Extensive experiments demonstrate that Valley has the potential to serve as an effective video assistant, simplifying complex video-understanding scenarios. Our code and data are published anonymously atthis https URL.

View on arXiv
@article{luo2025_2306.07207,
  title={ Valley: Video Assistant with Large Language model Enhanced abilitY },
  author={ Ruipu Luo and Ziwang Zhao and Min Yang and Zheming Yang and Minghui Qiu and Tao Wang and Zhongyu Wei and Yanhao Wang and Cen Chen },
  journal={arXiv preprint arXiv:2306.07207},
  year={ 2025 }
}
Comments on this paper