Cluster visualization is an essential task for nonlinear dimensionality reduction as a data analysis tool. It is often believed that Student t-Distributed Stochastic Neighbor Embedding (t-SNE) can show clusters for well clusterable data, with a smaller Kullback-Leibler divergence corresponding to a better quality. There was even theoretical proof for the guarantee of this property. However, we point out that this is not necessarily the case -- t-SNE may leave clustering patterns hidden despite strong signals present in the data. Extensive empirical evidence is provided to support our claim. First, several real-world counter-examples are presented, where t-SNE fails even if the input neighborhoods are well clusterable. Tuning hyperparameters in t-SNE or using better optimization algorithms does not help solve this issue because a better t-SNE learning objective can correspond to a worse cluster embedding. Second, we check the assumptions in the clustering guarantee of t-SNE and find they are often violated for real-world data sets.
翻译:集群化是非线性维度减少作为数据分析工具的一项基本任务。 人们通常认为, 学生的T- 分布式小邻里嵌入器( t- SNE) 能够显示集成数据组群集, 而最小的 Kullback- Leibler 差异则质量更高。 甚至有理论证明可以保证这种属性。 但我们指出, 情况不一定如此 -- -- t- SNE 可能隐藏集成模式, 尽管数据中存在强烈的信号。 提供了广泛的实证证据来支持我们的要求。 首先, 提供了几个真实世界反抽样, 即使输入区群集良好, t- SNE 也会失败。 在 t- SNE 中显示超参数或使用更好的优化算法无助于解决这个问题, 因为更好的 t- SNE 学习目标可以与更糟糕的集束嵌入相匹配。 其次, 我们检查 t- SNE 集群保证中的假设, 并发现这些假设常常被侵犯到真实世界的数据集中 。