We identify the task of measuring data to quantitatively characterize the composition of machine learning data and datasets. Similar to an object's height, width, and volume, data measurements quantify different attributes of data along common dimensions that support comparison. Several lines of research have proposed what we refer to as measurements, with differing terminology; we bring some of this work together, particularly in fields of computer vision and language, and build from it to motivate measuring data as a critical component of responsible AI development. Measuring data aids in systematically building and analyzing machine learning (ML) data towards specific goals and gaining better control of what modern ML systems will learn. We conclude with a discussion of the many avenues of future work, the limitations of data measurements, and how to leverage these measurement approaches in research and practice.
翻译:与物体的高度、宽度和体积相似,数据测量也用支持比较的共同层面量化数据的不同属性。 几项研究提出了我们所说的计量方法,使用了不同的术语;我们把其中一些工作结合起来,特别是在计算机视野和语言领域;我们从中推介其中的一些工作,作为负责任的AI发展的一个关键组成部分,激励衡量数据。衡量数据辅助工具,系统建立和分析机器学习数据,以实现具体目标,更好地控制现代ML系统将学习的内容。我们最后讨论了未来工作的许多途径、数据测量的局限性以及如何在研究和实践中利用这些计量方法。