Proper Dataset Valuation by Pointwise Mutual Information
Data plays a central role in advancements in modern artificial intelligence, with high-quality data emerging as a key driver of model performance. This has prompted the development of principled and effective data curation methods in recent years. However, existing methods largely rely on heuristics, and whether they are truly effective remains unclear. For instance, standard evaluation methods that assess a trained model's performance on specific benchmarks may incentivize assigning high scores to data that merely resembles the test set. This issue exemplifies Goodhart's law: when a measure becomes a target, it ceases to be a good measure. To address this issue, we propose an information-theoretic framework for evaluating data curation methods. We define dataset quality in terms of its informativeness about the true model parameters, formalized using the Blackwell ordering of informativeness. Under this ordering, Blackwell's theorem ensures that more informative data yields optimal models with lower expected loss on the true underlying distribution. To measure informativeness, we show that the Blackwell order can be determined by the Shannon mutual information between the curated data and the test data. To estimate this mutual information, we introduce a novel method that trains Bayesian models on embedded datasets and computes mutual information from the posteriors of model parameters. Experiments on real-world data demonstrate that our mutual information-based evaluation assigns appropriately lower scores to data curation strategies that reduce dataset informativeness, while traditional test score-based evaluation methods may favor data curation strategies that overfit to the test set but compromise the training data's informativeness.
View on arXiv