In uses of pre-trained machine learning models, it is a known issue that the target population in which the model is being deployed may not have been reflected in the source population with which the model was trained. This can result in a biased model when deployed, leading to a reduction in model performance. One risk is that, as the population changes, certain demographic groups will be under-served or otherwise disadvantaged by the model, even as they become more represented in the target population. The field of domain adaptation proposes techniques for a situation where label data for the target population does not exist, but some information about the target distribution does exist. In this paper we contribute to the domain adaptation literature by introducing domain-adaptive decision trees (DADT). We focus on decision trees given their growing popularity due to their interpretability and performance relative to other more complex models. With DADT we aim to improve the accuracy of models trained in a source domain (or training data) that differs from the target domain (or test data). We propose an in-processing step that adjusts the information gain split criterion with outside information corresponding to the distribution of the target population. We demonstrate DADT on real data and find that it improves accuracy over a standard decision tree when testing in a shifted target population. We also study the change in fairness under demographic parity and equal opportunity. Results show an improvement in fairness with the use of DADT.
翻译:在使用经过培训的机器学习模型方面,一个已知的问题是,采用该模型的目标人口可能没有反映在该模型所培训的来源人口中,这可能导致在部署时出现偏差模式,导致模型性能下降;一个风险是,随着人口的变化,某些人口群体将得不到足够的服务,或因其他条件更加复杂的模式而处于不利地位,即使他们更多地在目标人口中具有代表性;领域适应领域建议了一种技术,用于一种没有目标人口标签数据但有目标分布信息的某些信息的情况。在本文件中,我们通过引入适应域决策树(DADDT),为域适应文献作出贡献。我们注重决策树,因为它们的可解释性和业绩相对于其他更复杂的模式而言越来越受欢迎。我们利用DADT来提高在来源领域(或培训数据)培训模型的准确性,即使它们更多地在目标人口领域(或测试数据)中具有代表性。我们建议采取一个处理步骤,根据目标人口分布的外部信息调整信息的不同标准。我们展示DDDT关于真实数据的公平性,并发现在改变目标人口统计结果时,我们用一个机会进行测试。</s>