In recent years, ML researchers have wrestled with defining and improving machine learning (ML) benchmarks and datasets. In parallel, some have trained a critical lens on the ethics of dataset creation and ML research. In this position paper, we highlight the entanglement of ethics with seemingly ``technical'' or ``scientific'' decisions about the design of ML benchmarks. Our starting point is the existence of multiple overlooked structural similarities between human intelligence benchmarks and ML benchmarks. Both types of benchmarks set standards for describing, evaluating, and comparing performance on tasks relevant to intelligence -- standards that many scholars of human intelligence have long recognized as value-laden. We use perspectives from feminist philosophy of science on IQ benchmarks and thick concepts in social science to argue that values need to be considered and documented when creating ML benchmarks. It is neither possible nor desirable to avoid this choice by creating value-neutral benchmarks. Finally, we outline practical recommendations for ML benchmark research ethics and ethics review.
翻译:近年来,ML研究人员在界定和改进机器学习基准和数据集方面挣扎不已,与此同时,一些研究人员还就建立数据集和ML研究的伦理道德问题训练了一个关键透镜。在本立场文件中,我们强调道德与似乎“技术”或“科学”关于设计ML基准的决定纠缠不休。我们的出发点是人类情报基准和ML基准之间存在许多被忽视的结构性相似之处。两种基准都为描述、评价和比较与情报有关的任务的业绩制定了标准 -- -- 许多人类情报学者长期以来一直认为是价值累赘的标准。我们用女权主义科学哲学关于IQ基准和社会科学中粗略概念的观点来论证,在创建ML基准时需要考虑和记录价值观。通过创造中性基准来避免这种选择既不可能,也不可取。最后,我们为ML基准研究伦理和道德审查提出切实可行的建议。