Over the last three decades, researchers have intensively explored various clustering tools for categorical data analysis. Despite the proposal of various clustering algorithms, the classical k-modes algorithm remains a popular choice for unsupervised learning of categorical data. Surprisingly, our first insight is that in a natural generative block model, the k-modes algorithm performs poorly for a large range of parameters. We remedy this issue by proposing a soft rounding variant of the k-modes algorithm (SoftModes) and theoretically prove that our variant addresses the drawbacks of the k-modes algorithm in the generative model. Finally, we empirically verify that SoftModes performs well on both synthetic and real-world datasets.
翻译:在过去三十年中,研究人员深入探索了用于绝对数据分析的各种组合工具。 尽管提出了各种组合算法的建议,但古典 k-modes 算法仍然是不受监督地学习绝对数据的流行选择。 令人惊讶的是,我们的第一个洞察力是,在自然基因区块模型中, k-modes 算法在一系列参数上都表现不佳。 我们通过提出 k-modes 算法( SoftModes) 的软四舍五入变法( SoftModes ), 并在理论上证明我们的变法解决了基因模型中 k-modes 算法的缺点。 最后,我们从经验上证实SoftModes 在合成和现实世界数据集上的表现良好。