72

DataFlow: An LLM-Driven Framework for Unified Data Preparation and Workflow Automation in the Era of Data-Centric AI

Hao Liang
Xiaochen Ma
Zhou Liu
Zhen Hao Wong
Zhengyang Zhao
Zimo Meng
Runming He
Chengyu Shen
Qifeng Cai
Zhaoyang Han
Meiyi Qiang
Yalin Feng
Tianyi Bai
Zewei Pan
Ziyi Guo
Yizhen Jiang
Jingwen Deng
Qijie You
Peichao Lai
Tianyu Guo
Chi Hsu Tsai
Hengyi Feng
Rui Hu
Wenkai Yu
Junbo Niu
Bohan Zeng
Ruichuan An
Lu Ma
Jihao Huang
Yaowei Zheng
Conghui He
Linpeng Tang
Bin Cui
Weinan E
Wentao Zhang
Main:29 Pages
8 Figures
Bibliography:5 Pages
14 Tables
Appendix:2 Pages
Abstract

The rapidly growing demand for high-quality data in Large Language Models (LLMs) has intensified the need for scalable, reliable, and semantically rich data preparation pipelines. However, current practices remain dominated by ad-hoc scripts and loosely specified workflows, which lack principled abstractions, hinder reproducibility, and offer limited support for model-in-the-loop data generation. To address these challenges, we present DataFlow, a unified and extensible LLM-driven data preparation framework. DataFlow is designed with system-level abstractions that enable modular, reusable, and composable data transformations, and provides a PyTorch-style pipeline construction API for building debuggable and optimizable dataflows. The framework consists of nearly 200 reusable operators and six domain-general pipelines spanning text, mathematical reasoning, code, Text-to-SQL, agentic RAG, and large-scale knowledge extraction. To further improve usability, we introduce DataFlow-Agent, which automatically translates natural-language specifications into executable pipelines via operator synthesis, pipeline planning, and iterative verification. Across six representative use cases, DataFlow consistently improves downstream LLM performance. Our math, code, and text pipelines outperform curated human datasets and specialized synthetic baselines, achieving up to +3\% execution accuracy in Text-to-SQL over SynSQL, +7\% average improvements on code benchmarks, and 1--3 point gains on MATH, GSM8K, and AIME. Moreover, a unified 10K-sample dataset produced by DataFlow enables base models to surpass counterparts trained on 1M Infinity-Instruct data. These results demonstrate that DataFlow provides a practical and high-performance substrate for reliable, reproducible, and scalable LLM data preparation, and establishes a system-level foundation for future data-centric AI development.

View on arXiv
Comments on this paper