We study the problem of explainable clustering in the setting first formalized by Moshkovitz, Dasgupta, Rashtchian, and Frost (ICML 2020). A $k$-clustering is said to be explainable if it is given by a decision tree where each internal node splits data points with a threshold cut in a single dimension (feature), and each of the $k$ leaves corresponds to a cluster. We give an algorithm that outputs an explainable clustering that loses at most a factor of $O(\log^2 k)$ compared to an optimal (not necessarily explainable) clustering for the $k$-medians objective, and a factor of $O(k \log^2 k)$ for the $k$-means objective. This improves over the previous best upper bounds of $O(k)$ and $O(k^2)$, respectively, and nearly matches the previous $\Omega(\log k)$ lower bound for $k$-medians and our new $\Omega(k)$ lower bound for $k$-means. The algorithm is remarkably simple. In particular, given an initial not necessarily explainable clustering in $\mathbb{R}^d$, it is oblivious to the data points and runs in time $O(dk \log^2 k)$, independent of the number of data points $n$. Our upper and lower bounds also generalize to objectives given by higher $\ell_p$-norms.
翻译:我们首先研究Moshkovitz、Dasgupta、Rashtchian和Frost(ICML 2020年)在设定中正式化的可解释的分组问题。 如果每个内部节点将数据点分成一个单一尺寸的阈值(性能),而每个千美元叶上下对应一个组。 我们给出了一种算法,使可解释的分组与美元-中间值最佳(不一定可以解释)组合(美元-中间值)相比损失最多为O(log2 k)美元,而美元-中间值最佳(不一定解释)组合(美元-中间值)和美元-平均值目标为$(k log_ 2 k)的一个系数(k)是可以解释的。 这比美元(k)美元)和美元(k)2美元- 叶(k) 叶(k) 叶(k) 叶(k) 平面值(k) 最低组合值(k- 美元) 和美元(r- 美元) 美元(美元) 上下基值(美元) 的最小值数据值) 也非常简单解释。