Paint Outside the Box: Synthesizing and Selecting Training Data for Visual Grounding

1 December 2024

Abstract

Visual grounding aims to localize the image regions based on a textual query. Given the difficulty of large-scale data curation, we investigate how to effectively learn visual grounding under data-scarce settings in this paper. To address the data scarcity, we propose a novel framework, POBF (Paint Outside the Box and Filter). POBF synthesizes images by inpainting outside the box, tackling a label misalignment issue encountered in previous works. Furthermore, POBF leverages an innovative filtering scheme to select the most effective training data. This scheme combines a hardness score and an overfitting score, balanced by a penalty term. Extensive experiments across four benchmark datasets demonstrate that POBF consistently improves performance, achieving an average gain of 5.83\% over the real-data-only method and outperforming leading baselines by 2.29\%-3.85\% in accuracy. Additionally, we validate the robustness and generalizability of POBF across various generative models, training data sizes, and model architectures.

View on arXiv

@article{du2025_2412.00684,
  title={ Paint Outside the Box: Synthesizing and Selecting Training Data for Visual Grounding },
  author={ Zilin Du and Haoxin Li and Jianfei Yu and Boyang Li },
  journal={arXiv preprint arXiv:2412.00684},
  year={ 2025 }
}

Comments on this paper