We study fair clustering problems in a setting where distance information is obtained from two sources: a strong oracle providing exact distances, but at a high cost, and a weak oracle providing potentially inaccurate distance estimates at a low cost. The goal is to produce a near-optimal fair clustering on $n$ input points with a minimum number of strong oracle queries. This models the increasingly common trade-off between accurate but expensive similarity measures (e.g., large-scale embeddings) and cheaper but inaccurate alternatives. The study of fair clustering in the model is motivated by the important quest of achieving fairness with the presence of inaccurate information. We achieve the first $(1+\varepsilon)$-coresets for fair $k$-median clustering using $\text{poly}\left(\frac{k}{\varepsilon}\cdot\log n\right)$ queries to the strong oracle. Furthermore, our results imply coresets for the standard setting (without fairness constraints), and we could in fact obtain $(1+\varepsilon)$-coresets for $(k,z)$-clustering for general $z=O(1)$ with a similar number of strong oracle queries. In contrast, previous results achieved a constant-factor $(>10)$ approximation for the standard $k$-clustering problems, and no previous work considered the fair $k$-median clustering problem.
翻译:我们研究一种设置下的公平聚类问题,其中距离信息来自两个来源:一个强预言机提供精确距离但成本高昂,另一个弱预言机提供可能不准确的距离估计但成本低廉。目标是以最少的强预言机查询次数,在$n$个输入点上产生接近最优的公平聚类。这模拟了日益常见的在准确但昂贵的相似性度量(例如大规模嵌入)与廉价但不准确的替代方案之间的权衡。在该模型中研究公平聚类的动机,是在存在不准确信息的情况下实现公平性的重要探索。我们首次实现了公平$k$-中位数聚类的$(1+\varepsilon)$-核心集,仅需使用$\text{poly}\left(\frac{k}{\varepsilon}\cdot\log n\right)$次强预言机查询。此外,我们的结果意味着标准设置(无公平约束)下的核心集,并且事实上我们可以用类似数量的强预言机查询,为一般的$z=O(1)$获得$(k,z)$-聚类的$(1+\varepsilon)$-核心集。相比之下,先前的研究仅针对标准$k$-聚类问题实现了常数因子($>10$)近似,且未有先前工作考虑公平$k$-中位数聚类问题。