To act safely and ethically in the real world, agents must be able to reason about harm and avoid harmful actions. In this paper we develop the first statistical definition of harm and a framework for factoring harm into algorithmic decisions. We argue that harm is fundamentally a counterfactual quantity, and show that standard machine learning algorithms are guaranteed to pursue harmful policies in certain environments. To resolve this, we derive a family of counterfactual objective functions that robustly mitigate for harm. We demonstrate our approach with a statistical model for identifying optimal drug doses. While identifying optimal doses using the causal treatment effect results in harmful treatment decisions, our counterfactual algorithm identifies doses that are far less harmful without sacrificing efficacy. Our results show that counterfactual reasoning is a key ingredient for safe and ethical AI.
翻译:为了在现实世界中安全、合乎道德地采取行动,代理人必须能够理性地理解伤害并避免有害行动。在本文件中,我们制定了第一个伤害统计定义和将伤害因素纳入算法决定的框架。我们争辩说,伤害从根本上说是一种反事实的数量,并表明标准机器学习算法保证在某些环境中执行有害政策。为了解决这个问题,我们形成了一个反事实目标功能的大家庭,有力地减轻伤害。我们用统计模型展示了我们确定最佳药物剂量的方法。在确定因果治疗效果导致有害治疗决定的最佳剂量的同时,我们的反事实算法确定了在不牺牲效力的情况下危害性要小得多的剂量。我们的结果显示,反事实推理是安全和道德的AI的关键要素。