Conditional t-SNE (ct-SNE) is a recent extension to t-SNE that allows removal of known cluster information from the embedding, to obtain a visualization revealing structure beyond label information. This is useful, for example, when one wants to factor out unwanted differences between a set of classes. We show that ct-SNE fails in many realistic settings, namely if the data is well clustered over the labels in the original high-dimensional space. We introduce a revised method by conditioning the high-dimensional similarities instead of the low-dimensional similarities and storing within- and across-label nearest neighbors separately. This also enables the use of recently proposed speedups for t-SNE, improving the scalability. From experiments on synthetic data, we find that our proposed method resolves the considered problems and improves the embedding quality. On real data containing batch effects, the expected improvement is not always there. We argue revised ct-SNE is preferable overall, given its improved scalability. The results also highlight new open questions, such as how to handle distance variations between clusters.
翻译:条件t-SNE(ct-SNE)是近期对t-SNE的扩展,允许从嵌入式中去除已知集群信息,从而获得超出标签信息的结构。例如,在想要消除一组类之间不需要的差异时很有用。我们发现,在许多现实设置中,如果在原始高维空间中数据在标签上呈良好的集群,则ct-SNE会失败。我们通过对高维相似性进行条件约束,同时分别存储同类别和异类别最近邻居,提出了修订方法。这也使得可以使用最近提出的t-SNE加速方法,提高可扩展性。从对合成数据的实验中,我们发现我们的提议方法解决了考虑的问题并改善了嵌入式的质量。在包含批次效应的真实数据上,预期的改进并不总是存在。我们认为,鉴于改进的可扩展性,修订的ct-SNE总体上还是更好的。这些结果也突出了新的开放性问题,例如如何处理集群之间的距离变化。