We consider the problem of explainable $k$-medians and $k$-means introduced by Dasgupta, Frost, Moshkovitz, and Rashtchian~(ICML 2020). In this problem, our goal is to find a threshold decision tree that partitions data into $k$ clusters and minimizes the $k$-medians or $k$-means objective. The obtained clustering is easy to interpret because every decision node of a threshold tree splits data based on a single feature into two groups. We propose a new algorithm for this problem which is $\tilde O(\log k)$ competitive with $k$-medians with $\ell_1$ norm and $\tilde O(k)$ competitive with $k$-means. This is an improvement over the previous guarantees of $O(k)$ and $O(k^2)$ by Dasgupta et al (2020). We also provide a new algorithm which is $O(\log^{3/2} k)$ competitive for $k$-medians with $\ell_2$ norm. Our first algorithm is near-optimal: Dasgupta et al (2020) showed a lower bound of $\Omega(\log k)$ for $k$-medians; in this work, we prove a lower bound of $\tilde\Omega(k)$ for $k$-means. We also provide a lower bound of $\Omega(\log k)$ for $k$-medians with $\ell_2$ norm.
翻译:我们考虑的是Dasgupta、Frost、Moshkovitz和Rashtchian~(ICML 2020)提出的可解释的美元中值和美元中值问题。 在这个问题中,我们的目标是找到一个阈值决定树,将数据分割成美元组,并将美元中值或美元中值或美元中值最小化。 获得的分组很容易解释, 因为一个阈值树的每一个决定节点将基于一个特性的数据分成两个组。 我们为此问题提出了一个新的算法, 美元O( log k) 与美元中值中值中$1美元具有竞争力, 美元中值O( k) 以美元中值中值中值中, 美元中值中值中值中, 美元中值中值中值中, 美元中值中值中, 美元中值中值中,我们提供了一个新的算法。