Fair clustering aims to divide data into distinct clusters while preventing sensitive attributes (\textit{e.g.}, gender, race, RNA sequencing technique) from dominating the clustering. Although a number of works have been conducted and achieved huge success recently, most of them are heuristical, and there lacks a unified theory for algorithm design. In this work, we fill this blank by developing a mutual information theory for deep fair clustering and accordingly designing a novel algorithm, dubbed FCMI. In brief, through maximizing and minimizing mutual information, FCMI is designed to achieve four characteristics highly expected by deep fair clustering, \textit{i.e.}, compact, balanced, and fair clusters, as well as informative features. Besides the contributions to theory and algorithm, another contribution of this work is proposing a novel fair clustering metric built upon information theory as well. Unlike existing evaluation metrics, our metric measures the clustering quality and fairness as a whole instead of separate manner. To verify the effectiveness of the proposed FCMI, we conduct experiments on six benchmarks including a single-cell RNA-seq atlas compared with 11 state-of-the-art methods in terms of five metrics. The code could be accessed from \url{ https://pengxi.me}.
翻译:公平聚类旨在将数据分成不同的簇,同时防止敏感属性(例如性别、种族、RNA测序技术)在聚类中占主导地位。尽管最近进行了许多工作并取得了巨大成功,但大部分使用的方法都是启发式的,缺乏关于算法设计的统一理论。在这项工作中,我们通过发展深度公平聚类的互信息理论,并相应地设计一种新算法,称为FCMI,填补了这一空白。简要来说,通过最大化和最小化互信息,FCMI旨在实现深度公平聚类高度期望的四个特征——紧凑、平衡、公平的聚类以及信息量大的特征。除了对理论和算法的贡献外,本工作的另一个贡献是提出了一种建立在信息论上的新颖的公平聚类度量方法。与现有的评估指标不同,我们的度量方法将聚类质量和公平性作为一个整体来衡量,而非分开考虑。为了验证所提出的FCMI的效果,我们在包括单细胞RNA测序图谱在内的六个基准数据集上进行实验,并在五个指标上与11种最先进的方法进行比较。代码可以从 \url{https://pengxi.me} 获得。