First Person Action-Object Detection with EgoNet

15 March 2016

Gedas Bertasius

Abstract

A first-person camera, placed at the person's head, captures our visual sensorimotor object interactions. Can a single first-person image tell us about our momentary visual attention and motor action with objects, without a gaze tracking device or tactile sensors? To study the holistic correlation of visual attention with motor action, we use the concept of action-objects---objects that capture person's conscious visual (watching a TV) or tactile (taking a cup) interactions. Action-objects may be task-dependent but since many tasks share common person-object spatial configurations, action-objects exhibit a characteristic 3D spatial distance and orientation with respect to the person. Inspired by these observations, we propose to detect action-objects with EgoNet, a joint two-stream RGB and DHG network that holistically integrates visual appearance, head direction, 3D spatial cues, and that uses a first-person coordinate embedding, which is designed to learn spatial distribution of action-objects in the first-person data. In our experiments, we show that EgoNet consistently outperforms other approaches and that it also generalizes well to the previously unseen first-person datasets.

View on arXiv

Comments on this paper