We study the problem of explainable k-medians clustering introduced by Dasgupta, Frost, Moshkovitz, and Rashtchian (2020). In this problem, the goal is to construct a threshold decision tree that partitions data into k clusters while minimizing the k-medians objective. These trees are interpretable because each internal node makes a simple decision by thresholding a single feature, allowing users to trace and understand how each point is assigned to a cluster. We present the first algorithm for explainable k-medians under lp norm for every finite p >= 1. Our algorithm achieves an O(p(log k)^{1 + 1/p - 1/p^2}) approximation to the optimal k-medians cost for any p >= 1. Previously, algorithms were known only for p = 1 and p = 2. For p = 2, our algorithm improves upon the existing bound of O(log^{3/2}k), and for p = 1, it matches the tight bound of log k + O(1) up to a multiplicative O(log log k) factor. We show how to implement our algorithm in a dynamic setting. The dynamic algorithm maintains an explainable clustering under a sequence of insertions and deletions, with amortized update time O(d log^3 k) and O(log k) recourse, making it suitable for large-scale and evolving datasets.
翻译:我们研究了由Dasgupta、Frost、Moshkovitz和Rashtchian(2020)提出的可解释k-中值聚类问题。在该问题中,目标在于构建一个阈值决策树,将数据划分为k个聚类,同时最小化k-中值目标函数。这些决策树具有可解释性,因为每个内部节点通过对单个特征进行阈值判断来做出简单决策,使用户能够追踪并理解每个数据点如何被分配到聚类中。我们提出了首个适用于所有有限p ≥ 1的lp范数下可解释k-中值聚类算法。该算法对任意p ≥ 1实现了与最优k-中值成本相比的O(p(log k)^{1 + 1/p - 1/p^2})近似比。此前,已知算法仅适用于p = 1和p = 2的情况。对于p = 2,我们的算法改进了现有的O(log^{3/2}k)界;对于p = 1,算法在乘法因子O(log log k)范围内达到了紧界log k + O(1)。我们进一步展示了如何在动态场景中实现该算法。该动态算法能够在数据插入和删除序列下维护可解释聚类,其摊销更新时间为O(d log^3 k),调整复杂度为O(log k),适用于大规模和动态演化的数据集。