Efficient Segmental Inference for Spatiotemporal Modeling of
Fine-grained Actions
Joint segmentation and classification of fine-grained actions is important for applications in human-robot interaction, video surveillance, and human skill evaluation. In the first part of this paper we develop a spatiotemporal model that takes inspiration from early work in robot task modeling by learning how the state of the world changes over time. The spatial component models objects, locations, and object relationships and the temporal component models how actions transition throughout a sequence. In the second part of the paper we introduce an efficient algorithm for segmental inference that jointly segments and predicts all actions within a video. Our algorithm is tens to thousands of times faster than the current method on two fine-grained action recognition datasets. We highlight the effectiveness of our approach on the 50 Salads and JIGSAWS datasets and observe our model produces many fewer false-positives than competing methods and achieves state of the art performance.
View on arXiv