BCTR: Bidirectional Conditioning Transformer for Scene Graph Generation

Scene Graph Generation (SGG) remains a challenging task due to its compositional property. Previous approaches improve prediction efficiency through end-to-end learning. However, these methods exhibit limited performance as they assume unidirectional conditioning between entities and predicates, which restricts effective information interaction. To address this limitation, we propose a novel bidirectional conditioning factorization in a semantic-aligned space for SGG, enabling efficient and generalizable interaction between entities and predicates. Specifically, we introduce an end-to-end scene graph generation model, the Bidirectional Conditioning Transformer (BCTR), to implement this factorization. BCTR consists of two key modules. First, the Bidirectional Conditioning Generator (BCG) performs multi-stage interactive feature augmentation between entities and predicates, enabling mutual enhancement between these predictions. Second, Random Feature Alignment (RFA) is present to regularize feature space by distilling multi-modal knowledge from pre-trained models. Within this regularized feature space, BCG is feasible to capture interaction patterns across diverse relationships during training, and the learned interaction patterns can generalize to unseen but semantically related relationships during inference. Extensive experiments on Visual Genome and Open Image V6 show that BCTR achieves state-of-the-art performance on both benchmarks.
View on arXiv@article{hao2025_2407.18715, title={ BCTR: Bidirectional Conditioning Transformer for Scene Graph Generation }, author={ Peng Hao and Weilong Wang and Xiaobing Wang and Yingying Jiang and Hanchao Jia and Shaowei Cui and Junhang Wei and Xiaoshuai Hao }, journal={arXiv preprint arXiv:2407.18715}, year={ 2025 } }