Safely deploying machine learning models to the real world is often a challenging process. Models trained with data obtained from a specific geographic location tend to fail when queried with data obtained elsewhere, agents trained in a simulation can struggle to adapt when deployed in the real world or novel environments, and neural networks that are fit to a subset of the population might carry some selection bias into their decision process. In this work, we describe the problem of data shift from a novel information-theoretic perspective by (i) identifying and describing the different sources of error, (ii) comparing some of the most promising objectives explored in the recent domain generalization, and fair classification literature. From our theoretical analysis and empirical evaluation, we conclude that the model selection procedure needs to be guided by careful considerations regarding the observed data, the factors used for correction, and the structure of the data-generating process.
翻译:安全地向现实世界部署机器学习模型往往是一个具有挑战性的过程,对特定地理位置获得的数据进行训练的模型在与别处获得的数据进行查询时往往会失败,经过模拟培训的代理商在部署于现实世界或新环境时会难以适应,适合一部分人口的神经网络可能在其决策过程中带有某种选择偏差。在这项工作中,我们通过下列方式描述数据从新的信息理论角度转变的问题:(一) 查明和描述不同的误差源,(二) 比较最近领域一般化所探讨的一些最有希望的目标,以及公平的分类文献。根据我们的理论分析和经验评价,我们的结论是,在示范选择程序时,需要审慎地考虑所观察到的数据、用于纠正的因素以及数据生成过程的结构。