Data representativity is crucial when drawing inference from data through machine learning models. Scholars have increased focus on unraveling the bias and fairness in models, also in relation to inherent biases in the input data. However, limited work exists on the representativity of samples (datasets) for appropriate inference in AI systems. This paper reviews definitions and notions of a representative sample and surveys their use in scientific AI literature. We introduce three measurable concepts to help focus the notions and evaluate different data samples. Furthermore, we demonstrate that the contrast between a representative sample in the sense of coverage of the input space, versus a representative sample mimicking the distribution of the target population is of particular relevance when building AI systems. Through empirical demonstrations on US Census data, we evaluate the opposing inherent qualities of these concepts. Finally, we propose a framework of questions for creating and documenting data with data representativity in mind, as an addition to existing dataset documentation templates.
翻译:在通过机器学习模型从数据中推断出数据时,数据代表性至关重要。学者们越来越重视打破模型中的偏差和公平性,同时也重视输入数据中的固有偏差。然而,关于用于AI系统适当推论的样本(数据集)的代表性(数据集)的工作有限。本文回顾了具有代表性的样本的定义和概念,并调查了其在科学AI文献中的使用情况。我们引入了三个可衡量的概念,以帮助突出概念并评估不同的数据样本。此外,我们证明,在建立AI系统时,在输入空间的覆盖面方面,代表性样本与模拟目标人群分布的代表性样本之间的对比是特别相关的。我们通过对美国普查数据的经验性演示,评估这些概念的内在特性。最后,我们提出了一个问题框架,用于创建和记录带有数据代表性的数据,作为现有数据集文件模板的补充。