Clustering is a fundamental unsupervised learning problem where a dataset is partitioned into clusters that consist of nearby points in a metric space. A recent variant, fair clustering, associates a color with each point representing its group membership and requires that each color has (approximately) equal representation in each cluster to satisfy group fairness. In this model, the cost of the clustering objective increases due to enforcing fairness in the algorithm. The relative increase in the cost, the ''price of fairness,'' can indeed be unbounded. Therefore, in this paper we propose to treat an upper bound on the clustering objective as a constraint on the clustering problem, and to maximize equality of representation subject to it. We consider two fairness objectives: the group utilitarian objective and the group egalitarian objective, as well as the group leximin objective which generalizes the group egalitarian objective. We derive fundamental lower bounds on the approximation of the utilitarian and egalitarian objectives and introduce algorithms with provable guarantees for them. For the leximin objective we introduce an effective heuristic algorithm. We further derive impossibility results for other natural fairness objectives. We conclude with experimental results on real-world datasets that demonstrate the validity of our algorithms.
翻译:分组是一个根本性的、不受监督的学习问题, 即数据集被分割成由某计量空间的近点组成的集群。 最近的一个变式, 公平分组, 将颜色与代表其集团成员的每个点联系起来, 要求每个颜色在每组中具有( 大约) 平等的代表性, 以满足集团公平性。 在这个模型中, 组合目标的成本会因执行算法的公平性而增加。 成本的相对增加, “ 公平性价格 ” 确实可以不受约束。 因此, 在本文中, 我们提议将组合目标的上限作为限制集中问题的一种约束, 并尽量扩大代表性的平等性。 我们考虑两个公平性目标: 集体使用目标和群体平等性目标, 以及将群体平等性目标概括化的集团法性目标。 我们从实用性和平等性目标的近似值上得出基本较低的界限, 并引入具有可调控保证的算法。 对于词汇目标, 我们引入有效的超值算法性算法性算法性算法性。 我们进一步为其他自然公平性目标得出不可能的结果。 我们用真实性数据算法性。