ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2407.07356
24
3

Video In-context Learning: Autoregressive Transformers are Zero-Shot Video Imitators

10 July 2024
Wentao Zhang
Junliang Guo
Tianyu He
Li Zhao
Linli Xu
Jiang Bian
ArXivPDFHTML
Abstract

People interact with the real-world largely dependent on visual signal, which are ubiquitous and illustrate detailed demonstrations. In this paper, we explore utilizing visual signals as a new interface for models to interact with the environment. Specifically, we choose videos as a representative visual signal. And by training autoregressive Transformers on video datasets in a self-supervised objective, we find that the model emerges a zero-shot capability to infer the semantics from a demonstration video, and imitate the semantics to an unseen scenario. This allows the models to perform unseen tasks by watching the demonstration video in an in-context manner, without further fine-tuning. To validate the imitation capacity, we design various evaluation metrics including both objective and subjective measures. The results show that our models can generate high-quality video clips that accurately align with the semantic guidance provided by the demonstration videos, and we also show that the imitation capacity follows the scaling law. Code and models have been open-sourced.

View on arXiv
@article{zhang2025_2407.07356,
  title={ Video In-context Learning: Autoregressive Transformers are Zero-Shot Video Imitators },
  author={ Wentao Zhang and Junliang Guo and Tianyu He and Li Zhao and Linli Xu and Jiang Bian },
  journal={arXiv preprint arXiv:2407.07356},
  year={ 2025 }
}
Comments on this paper