End-to-end Flow Correlation Tracking with Spatial-temporal Attention
Discriminative correlation filters (DCF) with deep convolutional features have achieved favorable performance in recent tracking benchmarks. However, most of existing DCF trackers only consider appearance features of current frame, and hardly benefit from motion and interframe information. The lack of temporal information degrades the tracking performance during challenges such as partial occlusion and deformation. In this work, we focus on making using of the rich flow information in consecutive frames to improve the feature representation and the tracking accuracy. The historical feature maps is warped and aggregated with current ones by guiding of flow and an end-to-end training framework is developed for tracking. Specifically, individual components, including optical flow estimation, feature extraction, aggregation and correlation filter tracking are formulated as special layers in network. Then the previous frames at predefined intervals are warped to the current frame using optical flow information. Meanwhile, we propose a novel spatial-temporal attention mechanism to adaptively aggregate warped feature maps as well as current features maps. All the modules are trained end-to-end. To the best of our knowledge, this is the first work to jointly train flow and tracking task in a deep learning framework. Extensive experiments are performed on four challenging tracking datasets: OTB2013, OTB2015, VOT2015 and VOT2016, and our method achieves superior results on these benchmarks
View on arXiv