Croissant: A Metadata Format for ML-Ready Datasets
Mubashara Akhtar
Omar Benjelloun
Costanza Conforti
Pieter Gijsbers
Joan Giner-Miguelez
Nitisha Jain
Michael Kuchnik
Quentin Lhoest
Pierre Marcenac
M. Maskey
Peter Mattson
Luis Oala
Pierre Ruyssen
Rajat Shinde
Elena Simperl
Goeffry Thomas
Slava Tykhonov
Joaquin Vanschoren
Jos van der Velde
Steffen Vogler
Carole-Jean Wu

Abstract
Data is a critical resource for Machine Learning (ML), yet working with data remains a key friction point. This paper introduces Croissant, a metadata format for datasets that simplifies how data is used by ML tools and frameworks. Croissant makes datasets more discoverable, portable and interoperable, thereby addressing significant challenges in ML data management and responsible AI. Croissant is already supported by several popular dataset repositories, spanning hundreds of thousands of datasets, ready to be loaded into the most popular ML frameworks.
View on arXivComments on this paper