9
0

WikiVideo: Article Generation from Multiple Videos

Abstract

We present the challenging task of automatically creating a high-level Wikipedia-style article that aggregates information from multiple diverse videos about real-world events, such as natural disasters or political elections. Videos are intuitive sources for retrieval-augmented generation (RAG), but most contemporary RAG workflows focus heavily on text and existing methods for video-based summarization focus on low-level scene understanding rather than high-level event semantics. To close this gap, we introduce WikiVideo, a benchmark consisting of expert-written articles and densely annotated videos that provide evidence for articles' claims, facilitating the integration of video into RAG pipelines and enabling the creation of in-depth content that is grounded in multimodal sources. We further propose Collaborative Article Generation (CAG), a novel interactive method for article creation from multiple videos. CAG leverages an iterative interaction between an r1-style reasoning model and a VideoLLM to draw higher level inferences about the target event than is possible with VideoLLMs alone, which fixate on low-level visual features. We benchmark state-of-the-art VideoLLMs and CAG in both oracle retrieval and RAG settings and find that CAG consistently outperforms alternative methods, while suggesting intriguing avenues for future work.

View on arXiv
@article{martin2025_2504.00939,
  title={ WikiVideo: Article Generation from Multiple Videos },
  author={ Alexander Martin and Reno Kriz and William Gantt Walden and Kate Sanders and Hannah Recknor and Eugene Yang and Francis Ferraro and Benjamin Van Durme },
  journal={arXiv preprint arXiv:2504.00939},
  year={ 2025 }
}
Comments on this paper