Data-driven algorithms are only as good as the data they work with, while data sets, especially social data, often fail to represent minorities adequately. Representation Bias in data can happen due to various reasons ranging from historical discrimination to selection and sampling biases in the data acquisition and preparation methods. Given that "bias in, bias out", one cannot expect AI-based solutions to have equitable outcomes for societal applications, without addressing issues such as representation bias. While there has been extensive study of fairness in machine learning models, including several review papers, bias in the data has been less studied. This paper reviews the literature on identifying and resolving representation bias as a feature of a data set, independent of how consumed later. The scope of this survey is bounded to structured (tabular) and unstructured (e.g., image, text, graph) data. It presents taxonomies to categorize the studied techniques based on multiple design dimensions and provides a side-by-side comparison of their properties. There is still a long way to fully address representation bias issues in data. The authors hope that this survey motivates researchers to approach these challenges in the future by observing existing work within their respective domains.
翻译:数据驱动的算法的性能,关键取决于其数据。然而,数据集中可能存在着各种表示少数群体的不足。数据中的表示偏差可能由多种原因引起,包括历史上的不平等待遇,数据收集和准备过程中的选择偏差和抽样偏差等。由于“进其人之境,取其人之带”,因此,如果不解决诸如表示偏差等问题,就不能期望基于人工智能的解决方案在社会应用中具有公平的结果。虽然对于机器学习模型公平性的研究已经非常广泛,包括一些综述文章,但对于数据中的偏差问题的研究则相对较少。本文综述了识别和解决数据集特征中的表示偏差的相关文献,不考虑后续处理方式。本文的范围限定在结构化(表格)和非结构化(例如图像、文本和图表)数据。文章提出了基于多种设计维度对研究方法进行分类的分类法,并提供了它们属性的一一比较。要全面解决数据中的偏差问题,还有很长的路要走。作者希望,这篇综述能够激励研究人员在各自领域尝试从现有工作中汲取启示。