Neighbor embedding methods $t$-SNE and UMAP are the de facto standard for visualizing high-dimensional datasets. They appear to use very different loss functions with different motivations, and the exact relationship between them has been unclear. Here we show that UMAP is effectively negative sampling applied to the $t$-SNE loss function. We explain the difference between negative sampling and noise-contrastive estimation (NCE), which has been used to optimize $t$-SNE under the name NCVis. We prove that, unlike NCE, negative sampling learns a scaled data distribution. When applied in the neighbor embedding setting, it yields more compact embeddings with increased attraction, explaining differences in appearance between UMAP and $t$-SNE. Further, we generalize the notion of negative sampling and obtain a spectrum of embeddings, encompassing visualizations similar to $t$-SNE, NCVis, and UMAP. Finally, we explore the connection between representation learning in the SimCLR setting and neighbor embeddings, and show that (i) $t$-SNE can be optimized using the InfoNCE loss and in a parametric setting; (ii) various contrastive losses with only few noise samples can yield competitive performance in the SimCLR setup.
翻译:邻里嵌入方法 $t-SNE 和 UMAP 是可视化高维数据集的事实上标准。 它们似乎使用非常不同的损失功能,动机不同,确切的关系也不清楚。 我们在这里显示, UMAP实际上是对美元-SNE损失函数的负面抽样,我们解释了负面抽样和噪音扰动估计(NCE)之间的区别,后者一直用于在NCVis的名下优化美元-SNE(NCE)和噪音扰动估计(NCE)。 我们证明,与 NCEN不同的是,负抽样学习学习了规模扩大的数据分布。 当应用到邻居嵌入设置时,它产生更多吸引力的紧凑嵌入,解释了UMAP和美元-SNE之间的外观差异。 此外,我们概括了负抽样的概念,并获得了一系列嵌入,包括类似于$-SNE、NCVS和UMA。 最后,我们探讨了SM在SimC设置和邻居嵌入中的代表性学习之间的关联,并表明 (i) $SNENE只能与各种测算结果损失进行最佳的对比。