Researchers often face choices between multiple data sources that differ in quality, cost, and representativeness. Which sources will most improve predictive performance? We study this data prioritization problem under a random distribution shift model, where candidate sources arise from random perturbations to a target population. We propose the Data Usefulness Coefficient (DUC), which predicts the reduction in prediction error from adding a dataset to training, using only covariate summary statistics and no outcome data. We prove that under random shifts, covariate differences between sources are informative about outcome prediction quality. Through theory and experiments on synthetic and real data, we demonstrate that DUC-based selection outperforms alternative strategies, allowing more efficient resource allocation across heterogeneous data sources. The method provides interpretable rankings between candidate datasets and works for any data modality, including ordinal, categorical, and continuous data.
翻译:暂无翻译