Domain-specific accelerators (DSAs) are integrated in datacenter-scale clusters across industry to train increasingly-complex deep learning models over massive datasets. As innovations in DSAs continue to increase training efficiency and throughput, the data storage and ingestion (DSI) pipeline, the systems and hardware responsible for storing and preprocessing training data, will dominate and constrain training capacity. Similar innovation in DSI is urgent, demanding an in-depth understanding of DSI systems, infrastructure, and characteristics. To this end, this paper presents Meta's end-to-end DSI pipeline, composed of a central data warehouse built on distributed storage and a Data PreProcessing Service (DPP) that scales to eliminate data stalls. We characterize how hundreds of models are collaboratively trained across our global fleet, how massive and evolving datasets are stored and read, and how online preprocessing places intense demands on our underlying hardware. We synthesize key takeaways from our characterization and close with a discussion of lessons learned and research opportunities for both industry and academia.

View on arXiv

Comments on this paper