Widely used pipelines for analyzing high-dimensional data utilize two-dimensional visualizations. These are created, for instance, via t-distributed stochastic neighbor embedding (t-SNE). A crucial element of the t-SNE embedding procedure is the perplexity hyperparameter. That is because the embedding structure varies when perplexity is changed. A suitable perplexity choice depends on the data set and the intended usage for the embedding. Therefore, perplexity is often chosen based on heuristics, intuition, and prior experience. This paper uncovers a linear relationship between perplexity and the data set size. Namely, we show that embeddings remain structurally consistent across data set samples when perplexity is adjusted accordingly. Qualitative and quantitative experimental results support these findings. This informs the visualization process, guiding the user when choosing a perplexity value. Finally, we outline several applications for the visualization of high-dimensional data via t-SNE based on this linear relationship.
翻译:暂无翻译