Building trustworthy, effective, and responsible machine learning systems hinges on understanding how differences in training data and modeling decisions interact to impact predictive performance. In this work, we seek to better understand how we might characterize, detect, and design for data-model synergies. We focus on a particular type of data-model inefficiency, in which adding training data from some sources can actually lower performance evaluated on key sub-groups of the population, a phenomenon we refer to as negative data externalities on group performance. Such externalities can arise in standard learning settings and can manifest differently depending on conditions between training set size and model size. Data externalities directly imply a lower bound on feasible model improvements, yet improving models efficiently requires understanding the underlying data-model tensions. From a broader perspective, our results indicate that data-efficiency is a key component of both accurate and trustworthy machine learning.
翻译:建立值得信赖、有效和负责的机器学习系统取决于理解培训数据和建模决定的差异如何相互作用,从而影响预测性业绩。在这项工作中,我们力求更好地了解我们如何确定、检测和设计数据模型协同效应。我们侧重于特定类型的数据模型效率低下,从某些来源增加培训数据实际上可以降低对人口主要分组的绩效评价,我们称之为群体业绩的负面数据外差现象。这种外差现象可能在标准学习环境中产生,并且根据培训成套规模和模型大小的条件的不同而表现得不同。数据外差直接意味着对可行的模型改进的制约较小,而提高模型的效率则要求理解潜在的数据模型紧张。从更广泛的角度看,我们的结果表明数据效率是准确和可信赖的机器学习的关键组成部分。