To act safely and ethically in the real world, agents must be able to reason about harm and avoid harmful actions. In this paper we develop the first statistical definition of harm and a framework for incorporating harm into algorithmic decisions. We argue that harm is fundamentally a counterfactual quantity, and show that standard machine learning algorithms that cannot perform counterfactual reasoning are guaranteed to pursue harmful policies in certain environments. To resolve this we derive a family of counterfactual objective functions that robustly mitigate for harm. We demonstrate our approach with a statistical model for identifying optimal drug doses. While standard algorithms that select doses using causal treatment effects result in harmful doses, our counterfactual algorithm identifies doses that are significantly less harmful without sacrificing efficacy.
翻译:为了在现实世界中安全、合乎道德地采取行动,代理人必须能够理解伤害并避免有害行动。在本文件中,我们制定了第一个伤害统计定义和将伤害纳入算法决定的框架。我们争辩说,伤害从根本上说是一个反事实的数量,并表明不能进行反事实推理的标准机器学习算法可以保证在某些环境中执行有害政策。要解决这个问题,我们形成一个反事实目标功能的大家庭,能够有力地减轻伤害。我们用统计模型来展示我们确定最佳药物剂量的方法。虽然使用因果治疗效果选择剂量的标准算法可以导致有害剂量,但我们的反事实算法可以确定在不牺牲效果的情况下危害性要小得多的剂量。