PatrickStar: Parallel Training of Pre-trained Models via Chunk-based Memory Management

IEEE Transactions on Parallel and Distributed Systems (TPDS), 2021

12 August 2021

ArXiv (abs)PDF HTML Github (761★)

Abstract

The pre-trained model (PTM) is revolutionizing Artificial intelligence (AI) technology. It can learn general language features on massive data and then be fine-tuned on task-specific data. Unfortunately, the computing hardware requirement of PTM training is prohibitively expensive, which makes it a game for a small proportion of people in the AI community. Therefore, we proposed a system called PatrickStar to lower the hardware requirements of PTMs and make them accessible to everyone. PatrickStar uses the CPU-GPU heterogeneous memory space to store the model data. Different from existing works, we first manage the model data in a fine-grained manner by organizing them in memory chunks and dynamically distributing them in the heterogeneous memory space. Guided by the runtime memory statistics collected in a warm-up iteration, chunks are orchestrated efficiently in heterogeneous memory and generate lower CPU-GPU data transmission volume. Symbiosis with the Zero Redundancy Optimizer, PatrickStar scales to multiple GPUs using data parallelism, with lower communication bandwidth requirements and more efficient bandwidth utilization. The system can train tasks on bigger models and larger batch sizes, which existing works cannot complete. Experimental results show that PatrickStar trains a 12 billion parameters GPT model, 1.5x as large as the model scale limit of the SOTA works, on an 8xV100 and 240GB CPU memory node, and also achieves significantly higher computing efficiency than SOTA. Even on a $700 personal computer, it can train a 0.7 billion parameter GPT model. Our code is publicly available.

View on arXiv

Comments on this paper