0
0

MLKV: Efficiently Scaling up Large Embedding Model Training with Disk-based Key-Value Storage

Yongjun He
Roger Waleffe
Zhichao Han
Johnu George
Binhang Yuan
Zitao Zhang
Yinan Shan
Yang Zhao
Debojyoti Dutta
Theodoros Rekatsinas
Ce Zhang
Abstract

Many modern machine learning (ML) methods rely on embedding models to learn vector representations (embeddings) for a set of entities (embedding tables). As increasingly diverse ML applications utilize embedding models and embedding tables continue to grow in size and number, there has been a surge in the ad-hoc development of specialized frameworks targeted to train large embedding models for specific tasks. Although the scalability issues that arise in different embedding model training tasks are similar, each of these frameworks independently reinvents and customizes storage components for specific tasks, leading to substantial duplicated engineering efforts in both development and deployment. This paper presents MLKV, an efficient, extensible, and reusable data storage framework designed to address the scalability challenges in embedding model training, specifically data stall and staleness. MLKV augments disk-based key-value storage by democratizing optimizations that were previously exclusive to individual specialized frameworks and provides easy-to-use interfaces for embedding model training tasks. Extensive experiments on open-source workloads, as well as applications in eBay's payment transaction risk detection and seller payment risk detection, show that MLKV outperforms offloading strategies built on top of industrial-strength key-value stores by 1.6-12.6x. MLKV is open-source atthis https URL.

View on arXiv
@article{he2025_2504.01506,
  title={ MLKV: Efficiently Scaling up Large Embedding Model Training with Disk-based Key-Value Storage },
  author={ Yongjun He and Roger Waleffe and Zhichao Han and Johnu George and Binhang Yuan and Zitao Zhang and Yinan Shan and Yang Zhao and Debojyoti Dutta and Theodoros Rekatsinas and Ce Zhang },
  journal={arXiv preprint arXiv:2504.01506},
  year={ 2025 }
}
Comments on this paper