Recent advances in Natural Language Processing (NLP), and specifically automated Question Answering (QA) systems, have demonstrated both impressive linguistic fluency and a pernicious tendency to reflect social biases. In this study, we introduce Q-Pain, a dataset for assessing bias in medical QA in the context of pain management, one of the most challenging forms of clinical decision-making. Along with the dataset, we propose a new, rigorous framework, including a sample experimental design, to measure the potential biases present when making treatment decisions. We demonstrate its use by assessing two reference Question-Answering systems, GPT-2 and GPT-3, and find statistically significant differences in treatment between intersectional race-gender subgroups, thus reaffirming the risks posed by AI in medical settings, and the need for datasets like ours to ensure safety before medical AI applications are deployed.
翻译:最近自然语言处理系统(NLP),特别是自动化问题回答系统(QA)的进展,既表现出令人印象深刻的语言流畅,也显示出反映社会偏见的有害趋势。在本研究中,我们引入了QPain,这是评估疼痛管理中医学质量评估偏向的数据集,这是临床决策中最具挑战性的一种形式。除了数据集之外,我们提议一个新的严格框架,包括一个试样实验设计,以衡量在作出治疗决定时可能存在的偏向。我们通过评估两种参考问题回答系统GPT-2和GPT-3, 来证明它的使用,并发现种族与性别交叉分组之间在治疗方面有统计上的重大差异,从而重申了AI在医疗环境中构成的风险,以及像我们这样的数据集在部署医疗AI应用之前需要确保安全。