Although numerous algorithms have been proposed to solve the categorical data clustering problem, how to access the statistical significance of a set of categorical clusters remains unaddressed. To fulfill this void, we employ the likelihood ratio test to derive a test statistic that can serve as a significance-based objective function in categorical data clustering. Consequently, a new clustering algorithm is proposed in which the significance-based objective function is optimized via a Monte Carlo search procedure. As a by-product, we can further calculate an empirical $p$-value to assess the statistical significance of a set of clusters and develop an improved gap statistic for estimating the cluster number. Extensive experimental studies suggest that our method is able to achieve comparable performance to state-of-the-art categorical data clustering algorithms. Moreover, the effectiveness of such a significance-based formulation on statistical cluster validation and cluster number estimation is demonstrated through comprehensive empirical results.
翻译:尽管为解决绝对数据分组问题提出了许多算法,但如何获取一组绝对数据分组的统计意义仍未得到解决。为了填补这一空白,我们采用概率比率测试来得出一个测试统计数据,作为绝对数据分组中基于重要目标的功能。因此,建议采用一种新的组合算法,通过蒙特卡洛搜索程序优化基于重要目标的功能。作为一个副产品,我们可以进一步计算一个经验值$-价值,以评估一组数据分组的统计意义,并为估计分组数字制定更好的差距统计。广泛的实验研究表明,我们的方法能够取得与最新数据分组计算法的可比的性能。此外,这种基于重要性的统计分组验证和组号估算方法的有效性通过全面的经验结果得到证明。