Rising concern for the societal implications of artificial intelligence systems has inspired demands for greater transparency and accountability. However the datasets which empower machine learning are often used, shared and re-used with little visibility into the processes of deliberation which led to their creation. Which stakeholder groups had their perspectives included when the dataset was conceived? Which domain experts were consulted regarding how to model subgroups and other phenomena? How were questions of representational biases measured and addressed? Who labeled the data? In this paper, we introduce a rigorous framework for dataset development transparency which supports decision-making and accountability. The framework uses the cyclical, infrastructural and engineering nature of dataset development to draw on best practices from the software development lifecycle. Each stage of the data development lifecycle yields a set of documents that facilitate improved communication and decision-making, as well as drawing attention the value and necessity of careful data work. The proposed framework is intended to contribute to closing the accountability gap in artificial intelligence systems, by making visible the often overlooked work that goes into dataset creation.
翻译:对人工智能系统的社会影响的日益关切促使人们要求提高透明度和加强问责制。然而,授权机器学习的数据集常常在导致其创建的审议过程中被使用、共享和重新使用,而很少引起人们的注意。在设计数据集时,哪些利益攸关方群体的观点被包括在内?就如何模拟分组和其他现象征求了哪些领域的专家的意见?如何衡量和解决代表性偏见问题?谁给数据贴上了标签?在本文件中,我们为数据集发展透明度引入了一个严格的框架,支持决策和问责。框架利用数据集开发的周期性、基础设施和工程性来利用软件开发生命周期的最佳做法。数据开发生命周期的每个阶段都产生一套文件,有助于改进通信和决策,并提请注意认真数据工作的价值和必要性。拟议框架的目的是通过突出常常被忽视的创建数据集的工作,帮助弥合人工智能系统中的问责差距。