FloCoDe: Unbiased Dynamic Scene Graph Generation with Temporal
Consistency and Correlation Debiasing
Dynamic scene graph generation (SGG) from videos requires not only a comprehensive understanding of objects across scenes but also a method to capture the temporal motions and interactions with different objects. Moreover, the long-tailed distribution of visual relationships is a crucial bottleneck for most dynamic SGG methods. This is because many of them focus on capturing spatio-temporal context using complex architectures, leading to the generation of biased scene graphs. To address these challenges, we propose \textsc{FloCoDe}: \textbf{Flo}w-aware Temporal Consistency and \textbf{Co}rrelation \textbf{De}biasing with uncertainty attenuation for unbiased dynamic scene graphs. \textsc{FloCoDe} employs feature warping using flow to detect temporally consistent objects across frames. To address the long-tail issue of visual relationships, we propose correlation debiasing and a label correlation-based loss to learn unbiased relation representations for long-tailed classes. Specifically, we propose to incorporate label correlations using contrastive loss to capture commonly co-occurring relations, which aids in learning robust representations for long-tailed classes. Further, we adopt the uncertainty attenuation-based classifier framework to handle noisy annotations in the SGG data. Extensive experimental evaluation shows a performance gain as high as 4.1\%, demonstrating the superiority of generating more unbiased scene graphs.
View on arXiv