59
0

Data-Juicer Sandbox: A Feedback-Driven Suite for Multimodal Data-Model Co-development

Abstract

The emergence of multimodal large models has advanced artificial intelligence, introducing unprecedented levels of performance and functionality. However, optimizing these models remains challenging due to historically isolated paths of model-centric and data-centric developments, leading to suboptimal outcomes and inefficient resource utilization. In response, we present a new sandbox suite tailored for integrated data-model co-development. This sandbox provides a feedback-driven experimental platform, enabling cost-effective iteration and guided refinement of both data and models. Our proposed ``Probe-Analyze-Refine'' workflow, validated through practical use cases on multimodal tasks such as image-text pre-training with CLIP, image-to-text generation with LLaVA-like models, and text-to-video generation with DiT-based models, yields transferable and notable performance boosts, such as topping the VBench leaderboard. Extensive experiments also uncover fruitful insights into the interplay between data quality, diversity, model behavior, and computational costs. All codes, datasets, and models are open-sourced to foster future research and applications that would otherwise be infeasible due to the lack of a dedicated co-development infrastructure.

View on arXiv
@article{chen2025_2407.11784,
  title={ Data-Juicer Sandbox: A Feedback-Driven Suite for Multimodal Data-Model Co-development },
  author={ Daoyuan Chen and Haibin Wang and Yilun Huang and Ce Ge and Yaliang Li and Bolin Ding and Jingren Zhou },
  journal={arXiv preprint arXiv:2407.11784},
  year={ 2025 }
}
Comments on this paper