36
0

Learning to Generate Long-term Future Narrations Describing Activities of Daily Living

Abstract

Anticipating future events is crucial for various application domains such as healthcare, smart home technology, and surveillance. Narrative event descriptions provide context-rich information, enhancing a system's future planning and decision-making capabilities. We propose a novel task: long-term future narration generation\textit{long-term future narration generation}, which extends beyond traditional action anticipation by generating detailed narrations of future daily activities. We introduce a visual-language model, ViNa, specifically designed to address this challenging task. ViNa integrates long-term videos and corresponding narrations to generate a sequence of future narrations that predict subsequent events and actions over extended time horizons. ViNa extends existing multimodal models that perform only short-term predictions or describe observed videos by generating long-term future narrations for a broader range of daily activities. We also present a novel downstream application that leverages the generated narrations called future video retrieval to help users improve planning for a task by visualizing the future. We evaluate future narration generation on the largest egocentric dataset Ego4D.

View on arXiv
@article{rajendiran2025_2503.01416,
  title={ Learning to Generate Long-term Future Narrations Describing Activities of Daily Living },
  author={ Ramanathan Rajendiran and Debaditya Roy and Basura Fernando },
  journal={arXiv preprint arXiv:2503.01416},
  year={ 2025 }
}
Comments on this paper