Subset selection algorithms are ubiquitous in AI-driven applications, including, online recruiting portals and image search engines, so it is imperative that these tools are not discriminatory on the basis of protected attributes such as gender or race. Currently, fair subset selection algorithms assume that the protected attributes are known as part of the dataset. However, protected attributes may be noisy due to errors during data collection or if they are imputed (as is often the case in real-world settings). While a wide body of work addresses the effect of noise on the performance of machine learning algorithms, its effect on fairness remains largely unexamined. We find that in the presence of noisy protected attributes, in attempting to increase fairness without considering noise, one can, in fact, decrease the fairness of the result! Towards addressing this, we consider an existing noise model in which there is probabilistic information about the protected attributes (e.g., [58, 34, 20, 46]), and ask is fair selection possible under noisy conditions? We formulate a ``denoised'' selection problem which functions for a large class of fairness metrics; given the desired fairness goal, the solution to the denoised problem violates the goal by at most a small multiplicative amount with high probability. Although this denoised problem turns out to be NP-hard, we give a linear-programming based approximation algorithm for it. We evaluate this approach on both synthetic and real-world datasets. Our empirical results show that this approach can produce subsets which significantly improve the fairness metrics despite the presence of noisy protected attributes, and, compared to prior noise-oblivious approaches, has better Pareto-tradeoffs between utility and fairness.
翻译:由 AI 驱动的应用程序中, 包括在线招聘门户网站和图像搜索引擎, 子集选择算法无处不在。 因此, 这些工具必须在性别或种族等受保护属性的基础上不具有歧视性。 目前, 公平的子集选择算法假定受保护属性被称作数据集的一部分。 然而, 受保护属性可能由于数据收集过程中的错误或被估算( 在现实世界环境中经常发生的情况)而吵闹不休。 虽然大量的工作解决了噪音对机器学习算法绩效的影响, 但它对公平的影响仍然基本上没有得到审查。 我们发现, 在有噪音保护属性的情况下, 这些工具不是基于性别或种族等受保护属性的歧视。 目前, 公平的子集选算算算法假定受保护属性被称作数据集的一部分。 在解决这一问题时, 我们考虑的是现有的噪音模型模型, 有关于保护属性的准确性信息( 例如, [ 58, 34, 20, 46], 并且问能否在噪音方法下进行公平的选择? 我们制定了一个被保护的选项选择问题, 用来进行大层次的公平度测量; 鉴于期望的公平性属性的属性, 我们的属性可以大幅地评估一个目标, 高的数值, 而不是一个高级的精确度, 将一个目标 。