从真实数据向合成数据过渡:量化模型中的偏差 (Transitioning from Real to Synthetic data: Quantifying the bias in model)

With the advent of generative modeling techniques, synthetic data and its use has penetrated across various domains from unstructured data such as image, text to structured dataset modeling healthcare outcome, risk decisioning in financial domain, and many more. It overcomes various challenges such as limited training data, class imbalance, restricted access to dataset owing to privacy issues. To ensure the trained model used for automated decisioning purposes makes a fair decision there exist prior work to quantify and mitigate those issues. This study aims to establish a trade-off between bias and fairness in the models trained using synthetic data. Variants of synthetic data generation techniques were studied to understand bias amplification including differentially private generation schemes. Through experiments on a tabular dataset, we demonstrate there exist a varying levels of bias impact on models trained using synthetic data. Techniques generating less correlated feature performs well as evident through fairness metrics with 94\%, 82\%, and 88\% relative drop in DPD (demographic parity difference), EoD (equality of odds) and EoP (equality of opportunity) respectively, and 24\% relative improvement in DRP (demographic parity ratio) with respect to the real dataset. We believe the outcome of our research study will help data science practitioners understand the bias in the use of synthetic data.

翻译：随着基因模型技术的出现,合成数据及其使用从图象、文本和结构化的数据集模型保健结果、金融领域的风险决策等结构化数据等结构化数据进入了各个领域,从图象、文本和结构化的数据集模型、金融领域的风险决策等结构化数据到许多其他领域,克服了培训数据有限、阶级不平衡、由于隐私问题而限制使用数据集等各种挑战。为了确保用于自动决策的经过培训的模型事先作出公平的决定,以便量化和减轻这些问题。本研究的目的是在使用合成数据培训的模型中,在偏差和公平之间找到一种权衡。合成数据生成技术的变异性,以了解偏见的扩大,包括差异性私人生成计划。通过对表格数据集的实验,我们证明对使用合成数据培训的模式存在不同程度的偏差影响。产生不那么,产生相对关联性特征的技术表现得更明显,通过公平度指标(分别为94 ⁇ 、82 ⁇ 和88 ⁇ )在DPD(人口均等)和EoP(机会)中相对下降。我们认为,DRP(人口均等比率)与合成数据分析中的实际数据偏向性研究的结果。我们相信,我们的数据利用数据分析的结果。