Current AI regulations require discarding sensitive features (e.g., gender, race, religion) in the algorithm's decision-making process to prevent unfair outcomes. However, even without sensitive features in the training set, algorithms can persist in discrimination. Indeed, when sensitive features are omitted (fairness under unawareness), they could be inferred through non-linear relations with the so called proxy features. In this work, we propose a way to reveal the potential hidden bias of a machine learning model that can persist even when sensitive features are discarded. This study shows that it is possible to unveil whether the black-box predictor is still biased by exploiting counterfactual reasoning. In detail, when the predictor provides a negative classification outcome, our approach first builds counterfactual examples for a discriminated user category to obtain a positive outcome. Then, the same counterfactual samples feed an external classifier (that targets a sensitive feature) that reveals whether the modifications to the user characteristics needed for a positive outcome moved the individual to the non-discriminated group. When this occurs, it could be a warning sign for discriminatory behavior in the decision process. Furthermore, we leverage the deviation of counterfactuals from the original sample to determine which features are proxies of specific sensitive information. Our experiments show that, even if the model is trained without sensitive features, it often suffers discriminatory biases.
翻译:目前的大赦国际条例要求放弃算法决策过程中的敏感特征(例如性别、种族、宗教),以防止不公平的结果。然而,即使没有培训组合中的敏感特征,算法也可能继续存在歧视。事实上,当敏感特征被忽略(不知情情况下的公平性)时,可以通过与所谓的代理特征的非线性关系来推断这些特征。在这项工作中,我们建议一种方法,以揭示机器学习模式的潜在隐蔽偏见,即使在放弃敏感特征时,这种模式也可能持续存在。这项研究表明,有可能披露黑盒预测器是否仍然因利用反事实推理而产生偏差。详细来说,当预测器提供负面分类结果时,我们的方法首先为受歧视的用户类别建立反事实实例,以获得积极的结果。然后,同样的反事实样本为外部分类器(即目标)提供一种敏感特征,表明为取得积极结果而对用户特征的修改是否将个人转移到非歧视群体。当发生这种情况时,它可能是决定模型中歧视性行为的警告信号。此外,如果经过培训的预测器提供否定性特征,我们往往会利用我们所了解的敏感性模型的偏差。