Algorithms learn rules and associations based on the training data that they are exposed to. Yet, the very same data that teaches machines to understand and predict the world, contains societal and historic biases, resulting in biased algorithms with the risk of further amplifying these once put into use for decision support. Synthetic data, on the other hand, emerges with the promise to provide an unlimited amount of representative, realistic training samples, that can be shared further without disclosing the privacy of individual subjects. We present a framework to incorporate fairness constraints into the self-supervised learning process, that allows to then simulate an unlimited amount of representative as well as fair synthetic data. This framework provides a handle to govern and control for privacy as well as for bias within AI at its very source: the training data. We demonstrate the proposed approach by amending an existing generative model architecture and generating a representative as well as fair version of the UCI Adult census data set. While the relationships between attributes are faithfully retained, the gender and racial biases inherent in the original data are controlled for. This is further validated by comparing propensity scores of downstream predictive models that are trained on the original data versus the fair synthetic data. We consider representative & fair synthetic data a promising future building block to teach algorithms not on historic worlds, but rather on the worlds that we strive to live in.
翻译:然而,同样的数据教机器了解和预测世界,包含社会和历史偏见,导致偏差的算法,一旦用于决策支持,就有可能进一步扩大这些算法。 另一方面,合成数据出现,承诺提供无限数量的代表性、现实的培训样本,可以进一步分享,而不必透露个别主题的隐私。我们提出了一个框架,将公平限制纳入自我监督的学习过程,从而可以模拟无限数量的代表性和公平合成数据。这个框架提供了管理和控制隐私以及AI本身来源的偏见的手柄:培训数据。我们通过修改现有的基因化模型结构,产生具有代表性和公平版本的UCI成人普查数据集,展示了拟议方法。虽然属性之间的关系得到忠实保留,但原始数据中固有的性别和种族偏见得到了控制。通过比较下游预测模型的密度分数进一步验证,这些分数是在原始数据而不是合成数据方面,我们用原始数据而不是合成数据来学习,我们研究一个有前景的合成数据。我们考虑的是,在原始数据上,而不是合成数据上,我们研究一个有希望的合成数据。