A first line of attack in exploratory data analysis is data visualization, i.e., generating a 2-dimensional representation of data that makes clusters of similar points visually identifiable. Standard Johnson-Lindenstrauss dimensionality reduction does not produce data visualizations. The t-SNE heuristic of van der Maaten and Hinton, which is based on non-convex optimization, has become the de facto standard for visualization in a wide range of applications. This work gives a formal framework for the problem of data visualization - finding a 2-dimensional embedding of clusterable data that correctly separates individual clusters to make them visually identifiable. We then give a rigorous analysis of the performance of t-SNE under a natural, deterministic condition on the "ground-truth" clusters (similar to conditions assumed in earlier analyses of clustering) in the underlying data. These are the first provable guarantees on t-SNE for constructing good data visualizations. We show that our deterministic condition is satisfied by considerably general probabilistic generative models for clusterable data such as mixtures of well-separated log-concave distributions. Finally, we give theoretical evidence that t-SNE provably succeeds in partially recovering cluster structure even when the above deterministic condition is not met.
翻译:在探索性数据分析中,第一道攻击线是数据可视化,即产生使类似点组群可视化的数据可视化的二维显示。标准的约翰逊-Lindenstraus 维度减少并不产生数据可视化。基于非convex优化的van der Maaten和Hinton的t-SNE超光度(与先前对集群的分析所假设的条件类似),已经成为一系列广泛应用中可视化的事实上的标准。这项工作为数据可视化问题提供了一个正式框架,即找到一个使各组群群能够正确区分成可视化数据的二维嵌入,使其可视化为可视化的数据组群集。然后我们严格分析基础数据中“地面图”和Hinton的t-SNE在自然、确定性条件下的性能性能。这是在构建良好数据可视化的数据可视化数据可视化数据组群集的模型方面,我们的确定性性条件得到了满足。我们随后相当一般的可比较性基因化模型,如在最终的模型中进行回化后,我们最终的分类化的分类式分布式分布式分布在最后的模型中,使我们得以恢复了。