实施AI 道德操守的实地实验 (Biased Programmers? Or Biased Data? A Field Experiment in Operationalizing AI Ethics)

Why do biased predictions arise? What interventions can prevent them? We evaluate 8.2 million algorithmic predictions of math performance from $\approx$400 AI engineers, each of whom developed an algorithm under a randomly assigned experimental condition. Our treatment arms modified programmers' incentives, training data, awareness, and/or technical knowledge of AI ethics. We then assess out-of-sample predictions from their algorithms using randomized audit manipulations of algorithm inputs and ground-truth math performance for 20K subjects. We find that biased predictions are mostly caused by biased training data. However, one-third of the benefit of better training data comes through a novel economic mechanism: Engineers exert greater effort and are more responsive to incentives when given better training data. We also assess how performance varies with programmers' demographic characteristics, and their performance on a psychological test of implicit bias (IAT) concerning gender and careers. We find no evidence that female, minority and low-IAT engineers exhibit lower bias or discrimination in their code. However, we do find that prediction errors are correlated within demographic groups, which creates performance improvements through cross-demographic averaging. Finally, we quantify the benefits and tradeoffs of practical managerial or policy interventions such as technical advice, simple reminders, and improved incentives for decreasing algorithmic bias.

翻译：为何会出现偏向预测? 哪些干预措施可以防止这些预测? 我们从$approx$400 AI工程师那里评估了820万数学表现的算法预测,每个工程师在随机分配的实验条件下开发了算法; 我们的治疗武器包括了修改程序者的激励、培训数据、认识和/或对AI道德的技术性知识; 然后我们利用对算法投入的随机审计操作和20K科目的地面真实性数学表现来评估其算法的抽查外预测; 我们发现,偏向预测大多是由有偏差的培训数据造成的。然而,改进培训数据的好处有三分之一是通过一个新的经济机制产生的: 工程师在提供更好的培训数据时做出更大的努力,对激励做出更积极的反应。我们还评估了程序人员在人口特征方面的不同,以及他们在对性别和职业的隐性偏差进行心理测试后的表现。我们没有发现任何证据表明,女性、少数群体和低IAT工程师在其代码中表现出的偏差或歧视程度较低。但是,我们发现,预测错误与人口群体之间是相互关联的,通过跨人口学平均水平来提高绩效。