What does it mean for a clustering to be fair? One popular approach seeks to ensure that each cluster contains groups in (roughly) the same proportion in which they exist in the population. The normative principle at play is balance: any cluster might act as a representative of the data, and thus should reflect its diversity. But clustering also captures a different form of representativeness. A core principle in most clustering problems is that a cluster center should be representative of the cluster it represents, by being "close" to the points associated with it. This is so that we can effectively replace the points by their cluster centers without significant loss in fidelity, and indeed is a common "use case" for clustering. For such a clustering to be fair, the centers should "represent" different groups equally well. We call such a clustering a group-representative clustering. In this paper, we study the structure and computation of group-representative clusterings. We show that this notion naturally parallels the development of fairness notions in classification, with direct analogs of ideas like demographic parity and equal opportunity. We demonstrate how these notions are distinct from and cannot be captured by balance-based notions of fairness. We present approximation algorithms for group representative $k$-median clustering and couple this with an empirical evaluation on various real-world data sets.
翻译:集合意味着什么? 一种流行的方法是为了确保每个组群包含各组群在(大致)人口中具有相同比例的群体。 规范原则的作用是平衡:任何组群可能代表数据,因此应该反映数据的多样性。 但是,集群也捕捉了一种不同的代表性形式。 大多数组群问题的核心原则是,集群中心应该代表它所代表的组群,方法是“接近”与其相关的点。这样,我们就能有效地用它们的组群中心取代各组群点,而不会在忠诚方面遭受重大损失,而且确实是一个共同的“使用案例 ” 。为了公平起见,各组群应该“代表”不同的组群。我们称之为这种组群组组,我们在本文件中研究群群代表性集群的结构和计算。我们表明,这个概念自然地与分类中的公平概念的发展相平行,与人口均等和机会平等等概念直接相似。我们证明这些概念如何区别,而且不能通过基于平衡的公平概念来理解这些概念。我们称之为“提出”不同组群集的“代表群群群”和“我们目前以各种指数来进行全球数据组合。