实现机器学习数据集问责制:软件工程和基础设施的做法 (Towards Accountability for Machine Learning Datasets: Practices from Software Engineering and Infrastructure)

Rising concern for the societal implications of artificial intelligence systems has inspired demands for greater transparency and accountability. However the datasets which empower machine learning are often used, shared and re-used with little visibility into the processes of deliberation which led to their creation. Which stakeholder groups had their perspectives included when the dataset was conceived? Which domain experts were consulted regarding how to model subgroups and other phenomena? How were questions of representational biases measured and addressed? Who labeled the data? In this paper, we introduce a rigorous framework for dataset development transparency which supports decision-making and accountability. The framework uses the cyclical, infrastructural and engineering nature of dataset development to draw on best practices from the software development lifecycle. Each stage of the data development lifecycle yields a set of documents that facilitate improved communication and decision-making, as well as drawing attention the value and necessity of careful data work. The proposed framework is intended to contribute to closing the accountability gap in artificial intelligence systems, by making visible the often overlooked work that goes into dataset creation.

翻译：对人工智能系统的社会影响的日益关切促使人们要求提高透明度和加强问责制。然而,授权机器学习的数据集常常在导致其创建的审议过程中被使用、共享和重新使用,而很少引起人们的注意。在设计数据集时,哪些利益攸关方群体的观点被包括在内?就如何模拟分组和其他现象征求了哪些领域的专家的意见?如何衡量和解决代表性偏见问题?谁给数据贴上了标签?在本文件中,我们为数据集发展透明度引入了一个严格的框架,支持决策和问责。框架利用数据集开发的周期性、基础设施和工程性来利用软件开发生命周期的最佳做法。数据开发生命周期的每个阶段都产生一套文件,有助于改进通信和决策,并提请注意认真数据工作的价值和必要性。拟议框架的目的是通过突出常常被忽视的创建数据集的工作,帮助弥合人工智能系统中的问责差距。

相关内容

Machine Learning

关注 2239

机器学习（Machine Learning）是一个研究计算学习方法的国际论坛。该杂志发表文章，报告广泛的学习方法应用于各种学习问题的实质性结果。该杂志的特色论文描述研究的问题和方法，应用研究和研究方法的问题。有关学习问题或方法的论文通过实证研究、理论分析或与心理现象的比较提供了坚实的支持。应用论文展示了如何应用学习方法来解决重要的应用问题。研究方法论文改进了机器学习的研究方法。所有的论文都以其他研究人员可以验证或复制的方式描述了支持证据。论文还详细说明了学习的组成部分，并讨论了关于知识表示和性能任务的假设。官网地址：http://dblp.uni-trier.de/db/journals/ml/

【课程】Andrew Ng与Google Brain团队联合出品《TensorFlow in Practice 》

专知会员服务

13+阅读 · 2019年10月29日

面向机器学习和数据分析的特征工程（Feature Engineering for Machine Learning and Data Analytics），附新书419页pdf

专知会员服务

62+阅读 · 2019年10月26日

社交网络上议题社群的公共焦虑研究，中国人民大学新闻学院塔娜讲师，第八届全国社会媒体处理大会SMP2019

专知会员服务

15+阅读 · 2019年10月23日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日