As an intrinsic and fundamental property of big data, data heterogeneity exists in a variety of real-world applications, such as precision medicine, autonomous driving, financial applications, etc. For machine learning algorithms, the ignorance of data heterogeneity will greatly hurt the generalization performance and the algorithmic fairness, since the prediction mechanisms among different sub-populations are likely to differ from each other. In this work, we focus on the data heterogeneity that affects the prediction of machine learning models, and firstly propose the \emph{usable predictive heterogeneity}, which takes into account the model capacity and computational constraints. We prove that it can be reliably estimated from finite data with probably approximately correct (PAC) bounds. Additionally, we design a bi-level optimization algorithm to explore the usable predictive heterogeneity from data. Empirically, the explored heterogeneity provides insights for sub-population divisions in income prediction, crop yield prediction and image classification tasks, and leveraging such heterogeneity benefits the out-of-distribution generalization performance.
翻译:作为大数据固有且基本的特性,数据异质性在各种实际应用中存在,例如精准医学、自动驾驶、金融应用等。对于机器学习算法而言,忽略数据异质性将极大地损害泛化性能和算法公平性,因为不同子群体之间的预测机制可能互不相同。在本文中,我们专注于影响机器学习模型预测的数据异质性,并首次提出了可利用的\emph{预测异质性},该异质性考虑了模型容量和计算约束。我们证明它可以从有限数据中可靠地估计,并具有可能近似正确的(PAC)界限。此外,我们设计了一个双层优化算法来从数据中探索可用的预测异质性。实证上,探索出的异质性为收入预测、作物产量预测和图像分类任务中的子群体划分提供了洞见,利用此类异质性有益于超出分布的泛化性能。