Clustering is a fundamental building block of modern statistical analysis pipelines. Fair clustering has seen much attention from the machine learning community in recent years. We are some of the first to study fairness in the context of hierarchical clustering, after the results of Ahmadian et al. from NeurIPS in 2020. We evaluate our results using Dasgupta's cost function, perhaps one of the most prevalent theoretical metrics for hierarchical clustering evaluation. Our work vastly improves the previous $O(n^{5/6}poly\log(n))$ fair approximation for cost to a near polylogarithmic $O(n^\delta poly\log(n))$ fair approximation for any constant $\delta\in(0,1)$. This result establishes a cost-fairness tradeoff and extends to broader fairness constraints than the previous work. We also show how to alter existing hierarchical clusterings to guarantee fairness and cluster balance across any level in the hierarchy.
翻译:分组是现代统计分析管道的基本构件。 近几年来,机器学习界对公平分组问题给予了大量关注。 在2020年NeurIPS的Ahmadian等人(Ahmadian等人)的研究结果之后,我们率先研究等级分组的公平性。 我们使用Dasgupta的成本功能来评估我们的成果,这也许是对等级分组评价最普遍的理论衡量标准之一。 我们的工作极大地改进了以前的$O(n ⁇ 5/6}poly\log(n))美元,以公平近似于美元,以至于接近一个常数的$O(n ⁇ delta polilog(n)$(0,1美元)的公平近似值。这一结果确立了成本公平性权衡,并扩大到比以往工作更广泛的公平性制约。 我们还展示了如何改变现有的等级分组,以确保等级各级的公平和分组平衡。