UMAP has supplanted t-SNE as state-of-the-art for visualizing high-dimensional datasets in many disciplines, while the reason for its success is not well understood. In this work, we investigate UMAP's sampling based optimization scheme in detail. We derive UMAP's effective loss function in closed form and find that it differs from the published one. As a consequence, we show that UMAP does not aim to reproduce its theoretically motivated high-dimensional UMAP similarities. Instead, it tries to reproduce similarities that only encode the shared k nearest neighbor graph, thereby challenging the previous understanding of UMAP's effectiveness. Instead, we claim that the key to UMAP's success is its implicit balancing of attraction and repulsion resulting from negative sampling. This balancing in turn facilitates optimization via gradient descent. We corroborate our theoretical findings on toy and single cell RNA sequencing data.
翻译:UMAP取代了t-SNE, 以作为在许多学科中直观高维数据集的最新技术, 但其成功的原因并不十分清楚。 在这项工作中, 我们详细调查UMAP基于抽样的优化计划。 我们以封闭的形式得出UMAP的有效损失功能, 发现其与所公布的功能不同。 因此, 我们显示 UMAP无意复制其具有理论动机的高维UMAP相似之处。 相反, 它试图复制相似之处, 只编码了共享的 k 相邻图, 从而挑战了以前对 UMAP 有效性的理解。 相反, 我们声称, UMAP 成功的关键在于它从负抽样中暗含的吸引力和反作用的平衡。 这种平衡反过来又有助于通过梯度下降实现优化。 我们证实了我们关于 Toy 和 单细胞 RNA 测序数据的理论结论 。