Given a set of points in $d$-dimensional space, an explainable clustering is one where the clusters are specified by a tree of axis-aligned threshold cuts. Dasgupta et al. (ICML 2020) posed the question of the price of explainability: the worst-case ratio between the cost of the best explainable clusterings to that of the best clusterings. We show that the price of explainability for $k$-medians is at most $1+H_{k-1}$; in fact, we show that the popular Random Thresholds algorithm has exactly this price of explanability, matching the known lower bound constructions. We complement our tight analysis of this particular algorithm by constructing instances where the price of explanability (using any algorithm) is at least $(1-o(1)) \ln k$, showing that our result is best possible, up to lower-order terms. We also improve the price of explanability for the $k$-means problem to $O(k \ln \ln k)$ from the previous $O(k \ln k)$, considerably closing the gap to the lower bounds of $\Omega(k)$. Finally, we study the algorithmic question of finding the best explainable clustering: We show that explainable $k$-medians and $k$-means cannot be approximated better than $O(\ln k)$, under standard complexity-theoretic conjectures. This essentially settles the approximability of explainable $k$-medians and leaves open the intriguing possibility to get significantly better approximation algorithms for $k$-means than its price of explainability.
翻译:可解释性聚类的代价
翻译后的摘要:
给定$d$维空间中的一组点,一个可解释的聚类是指其聚类由一棵轴对齐阈值树指定。Dasgupta等人(ICML 2020)提出了可解释性代价的问题:最好的可解释聚类与最好的聚类之间的最坏比率。我们证明了$k$中值的可解释性代价最多为$1+H_{k-1}$;事实上,我们证明了流行的随机阈值算法恰好具有此可解释性代价,与已知的下限构造相匹配。通过构造实例,我们证明了可解释性代价(使用任何算法)至少为$(1-o(1))\ln k$,表明我们的结果是最佳的,直到较低级别的术语。我们还将$k$均值问题的可解释性代价从以前的$O(k \ln k)$改进到$O(k\ln\ln k)$,大大缩小了与$\Omega(k)$的下限之间的差距。最后,我们研究了找到最佳可解释聚类的算法问题:在标准的复杂性推断下,我们表明可解释$k$中值和$k$均值不能更好地逼近$O(\ln k)$。这基本上解决了可解释性$k$中值的可近似性,并留下了有关为$k$均值获得明显更好的近似算法的有趣可能性。