279

Learning from narrated instruction videos

Abstract

We address the problem of automatically learning the main steps to complete a certain task, such as changing a car tire, from a set of narrated instruction videos. The contributions of this paper are three-fold. First, we develop a joint model for video and natural language narration that takes advantage of the complementary nature of the two signals. Second, we collect an annotated dataset of 57 Internet instruction videos containing more than 350,000 frames for two tasks (changing car tire and CardioPulmonary Resuscitation). Third, we experimentally demonstrate that the proposed model automatically discovers, in an unsupervised manner, the main steps to achieve each task and locate them within the input videos. The results further show that the proposed model outperforms single-modality baselines, demonstrating the benefits of joint modeling video and text.

View on arXiv
Comments on this paper