FairGen:公平合成数据生成 (FairGen: Fair Synthetic Data Generation)

With the rising adoption of Machine Learning across the domains like banking, pharmaceutical, ed-tech, etc, it has become utmost important to adopt responsible AI methods to ensure models are not unfairly discriminating against any group. Given the lack of clean training data, generative adversarial techniques are preferred to generate synthetic data with several state-of-the-art architectures readily available across various domains from unstructured data such as text, images to structured datasets modelling fraud detection and many more. These techniques overcome several challenges such as class imbalance, limited training data, restricted access to data due to privacy issues. Existing work focusing on generating fair data either works for a certain GAN architecture or is very difficult to tune across the GANs. In this paper, we propose a pipeline to generate fairer synthetic data independent of the GAN architecture. The proposed paper utilizes a pre-processing algorithm to identify and remove bias inducing samples. In particular, we claim that while generating synthetic data most GANs amplify bias present in the training data but by removing these bias inducing samples, GANs essentially focuses more on real informative samples. Our experimental evaluation on two open-source datasets demonstrates how the proposed pipeline is generating fair data along with improved performance in some cases.

翻译：随着银行、制药、电子技术等各领域越来越多地采用机器学习等机器学习,采取负责任的AI方法确保模型不会对任何群体造成不公平的歧视,就变得极为重要了。鉴于缺乏清洁的培训数据,偏好采用基因对抗技术来生成合成数据,利用若干最先进的结构架构,从文本、图像、结构化数据集等非结构化数据,从文本、图像、模拟欺诈检测等结构化数据集和许多其他方面,在各个领域很容易获得合成数据。这些技术克服了若干挑战,如阶级不平衡、培训数据有限、由于隐私问题而限制数据获取机会等。现有的侧重于为某个GAN结构制作公平数据的工作,或者在某些GAN结构中很难调和。在本文件中,我们提议建立一个管道,以产生较公平的合成数据,独立于GAN结构。拟议的文件使用预处理算法来识别和消除偏向诱导的样本。特别是,我们声称,在产生大多数合成GAN数据的同时,通过消除这些偏向导出样本的偏向性,GAN基本上更侧重于真实的信息样本。我们关于两个开源数据集的实验性评估表明,在编审中如何产生公正的数据。