Hierarchical Agglomerative Clustering (HAC) algorithms are extensively utilized in modern data science, and seek to partition the dataset into clusters while generating a hierarchical relationship between the data samples. HAC algorithms are employed in many applications, such as biology, natural language processing, and recommender systems. Thus, it is imperative to ensure that these algorithms are fair -- even if the dataset contains biases against certain protected groups, the cluster outputs generated should not discriminate against samples from any of these groups. However, recent work in clustering fairness has mostly focused on center-based clustering algorithms, such as k-median and k-means clustering. In this paper, we propose fair algorithms for performing HAC that enforce fairness constraints 1) irrespective of the distance linkage criteria used, 2) generalize to any natural measures of clustering fairness for HAC, 3) work for multiple protected groups, and 4) have competitive running times to vanilla HAC. Through extensive experiments on multiple real-world UCI datasets, we show that our proposed algorithm finds fairer clusterings compared to vanilla HAC as well as other state-of-the-art fair clustering approaches.
翻译:在现代数据科学中广泛使用等级聚合(HAC)算法,并试图将数据集分成组群,同时在数据样本之间形成等级关系。HAC算法在许多应用中被采用,例如生物学、自然语言处理、建议系统等。因此,必须确保这些算法是公平的 -- -- 即使数据集含有对某些受保护群体的偏见,所产生的集束产出不应歧视任何这类群体的样本。然而,最近关于集束公平性的工作主要集中在以中心为基础的集束算法上,例如 k-median 和 k- means 群集。在本文中,我们提出公平算法,用于执行执行执行公平限制的HAC,1 无论使用的距离联系标准如何,2) 概括地说明HAC的任何自然集束公平性衡量标准,3 多个受保护群体的工作, 4 4) 与vanilla HAC有竞争的运行时间。通过对多个真实世界的UCI数据集进行广泛的实验,我们提议的计算法发现,与Vanilla HAC 和其他状态的公平组合方法相比,我们提议的算法发现比更公平的集群更公平。