We study fair clustering problems as proposed by Chierichetti et al. (NIPS 2017). Here, points have a sensitive attribute and all clusters in the solution are required to be balanced with respect to it (to counteract any form of data-inherent bias). Previous algorithms for fair clustering do not scale well. We show how to model and compute so-called coresets for fair clustering problems, which can be used to significantly reduce the input data size. We prove that the coresets are composable and show how to compute them in a streaming setting. Furthermore, we propose a variant of Lloyd's algorithm that computes fair clusterings and extend it to a fair k-means++ clustering algorithm. We implement these algorithms and provide empirical evidence that the combination of our approximation algorithms and the coreset construction yields a scalable algorithm for fair k-means clustering.
翻译:我们研究Chierichetti等人(2017年,NIPS)提出的公平分组问题。在这里,各点具有敏感属性,解决方案中的所有组群都需要平衡(以抵消任何形式的数据内在偏差)。以前的公平分组算法规模不高。我们展示了如何为公平分组问题建模和计算所谓的核心群集,这些核心群集可用于显著缩小输入数据大小。我们证明核心群集是可比较的,并显示如何在串流环境中进行计算。此外,我们提出了劳埃德的算法变量,计算公平分组,并将其扩展至公平的K- means+组群算法。我们实施了这些算法,并提供了经验证据,证明我们近似算法和核心群集构造的组合为公平的K-point群集提供了可缩缩放的算法。