Classical inference methods notoriously fail when applied to data-driven test hypotheses or inference targets. Instead, dedicated methodologies are required to obtain statistical guarantees for these selective inference problems. Selective inference is particularly relevant post-clustering, typically when testing a difference in mean between two clusters. In this paper, we address convex clustering with $\ell_1$ penalization, by leveraging related selective inference tools for regression, based on Gaussian vectors conditioned to polyhedral sets. In the one-dimensional case, we prove a polyhedral characterization of obtaining given clusters, than enables us to suggest a test procedure with statistical guarantees. This characterization also allows us to provide a computationally efficient regularization path algorithm. Then, we extend the above test procedure and guarantees to multi-dimensional clustering with $\ell_1$ penalization, and also to more general multi-dimensional clusterings that aggregate one-dimensional ones. With various numerical experiments, we validate our statistical guarantees and we demonstrate the power of our methods to detect differences in mean between clusters. Our methods are implemented in the R package poclin.
翻译:经典推断方法在应用于数据驱动的检验假设或推断目标时存在显著缺陷。针对这类选择性推断问题,需要专门的方法论以获得统计保证。选择性推断在聚类后分析中尤为重要,通常用于检验两个聚类间均值差异。本文通过借鉴基于高斯向量在多面体集合条件下回归的相关选择性推断工具,研究$\ell_1$惩罚下的凸聚类问题。在一维情形中,我们证明了获得特定聚类的多面体表征定理,据此提出具有统计保证的检验流程。该表征定理还使我们能够提供计算高效的正则化路径算法。随后,我们将上述检验流程与统计保证扩展至$\ell_1$惩罚下的多维聚类,以及聚合多个一维聚类的更广义多维聚类。通过多种数值实验,我们验证了统计保证的有效性,并证明了该方法在检测聚类间均值差异方面的效力。相关方法已实现于R软件包poclin中。