Data plays a central role in the development of modern artificial intelligence, with high-quality data emerging as a key driver of model performance. This has prompted the development of various data curation methods in recent years. However, measuring the effectiveness of these data curation techniques remains a major challenge. Traditional evaluation methods, which assess a trained model's performance on specific benchmarks, risk promoting practices that merely make the data more similar to the test data. This issue exemplifies Goodhart's law: when a measure becomes a target, it ceases to be a good measure. To address this, we propose an information-theoretic framework for evaluating data curation methods, where dataset quality is measured by its informativeness about the true model parameters using the Blackwell ordering. We compare informativeness by the Shannon mutual information of the evaluated data and the test data, and we propose a novel method for estimating the mutual information of datasets by training Bayesian models on embedded data and computing the mutual information from the model's parameter posteriors. Experiments on real-world data demonstrate that our mutual information-based evaluation assigns appropriately lower scores to data curation strategies that reduce dataset informativeness, while traditional test score-based evaluation methods may favor data curation strategies that overfit to the test set but compromise the training data's informativeness.
翻译:暂无翻译