Machine learning models have been criticized for reflecting unfair biases in the training data. Instead of solving for this by introducing fair learning algorithms directly, we focus on generating fair synthetic data, such that any downstream learner is fair. Generating fair synthetic data from unfair data - while remaining truthful to the underlying data-generating process (DGP) - is non-trivial. In this paper, we introduce DECAF: a GAN-based fair synthetic data generator for tabular data. With DECAF we embed the DGP explicitly as a structural causal model in the input layers of the generator, allowing each variable to be reconstructed conditioned on its causal parents. This procedure enables inference time debiasing, where biased edges can be strategically removed for satisfying user-defined fairness requirements. The DECAF framework is versatile and compatible with several popular definitions of fairness. In our experiments, we show that DECAF successfully removes undesired bias and - in contrast to existing methods - is capable of generating high-quality synthetic data. Furthermore, we provide theoretical guarantees on the generator's convergence and the fairness of downstream models.
翻译:机器学习模式被批评为反映了培训数据中的不公平偏差。 我们没有直接引入公平学习算法来解决这个问题,而是侧重于生成公平的合成数据,这样任何下游学习者都是公平的。 从不公平的数据中产生公平的合成数据――虽然对基本数据生成过程(DGP)仍然诚实不言而喻。 在本文中,我们引入了DECAF:一个基于GAN的公平合成数据生成器,用于表格数据。DECAF将DGP明确作为结构性因果模型嵌入生成器的输入层,允许每个变量以其有因果关系的父母为条件进行重建。这个程序可以推断时间偏差,从而从战略上消除偏差的边缘,满足用户定义的公平要求。DECAF框架是多功能的,与一些流行的公平定义相容。我们实验表明,DECAF成功地消除了不受欢迎的偏差,与现有方法相比,能够生成高质量的合成数据。此外,我们从理论上保证发电机的趋同性和下游模型的公平性。