订正条件t-SNE:展望近邻以外 (Revised Conditional t-SNE: Looking Beyond the Nearest Neighbors)

from arxiv, 13 pages, 8 pages supplement, to be published in the Proceedings of the 21st International Symposium on Intelligent Data Analysis (IDA 2023), Springer, 2023

Conditional t-SNE (ct-SNE) is a recent extension to t-SNE that allows removal of known cluster information from the embedding, to obtain a visualization revealing structure beyond label information. This is useful, for example, when one wants to factor out unwanted differences between a set of classes. We show that ct-SNE fails in many realistic settings, namely if the data is well clustered over the labels in the original high-dimensional space. We introduce a revised method by conditioning the high-dimensional similarities instead of the low-dimensional similarities and storing within- and across-label nearest neighbors separately. This also enables the use of recently proposed speedups for t-SNE, improving the scalability. From experiments on synthetic data, we find that our proposed method resolves the considered problems and improves the embedding quality. On real data containing batch effects, the expected improvement is not always there. We argue revised ct-SNE is preferable overall, given its improved scalability. The results also highlight new open questions, such as how to handle distance variations between clusters.

翻译：有条件的 t- SNE (ct- SNE) 是最近对 t- SNE (ct- SNE) 的延伸, 允许将已知的集群信息从嵌入中去除, 从而获得标签信息之外的可视化显示结构。例如, 当人们想要将一组分类之间不必要的差异考虑在内时, 这一点是有用的。我们显示, ct- SNE 在许多现实的环境下都失败了, 也就是说, 如果数据在原高维空间的标签上充分组合在一起, 即如果数据在原始高维空间的标签上, 即如果数据在原高维空间的相似点上存在,, 我们引入了一种经过修改的方法, 以调节高维异点, 而不是将低维的相似点分开, 并单独存储在贴标签最近的邻居内部和跨标签的近邻里。这也有助于使用最近提议的 t- SNE 快速递增, 从而改进可缩缩放性。我们从合成数据的实验中发现, 我们建议的方法可以解决所考虑的问题, 并改进了嵌入质量。在含有批量效应的真实数据中, 。我们认为, 期望的改进的改进的结果并非总更可取, 。我们认为, 认为, 认为, 因为它比较可取, 因为它具有更。