美元-SNE指数:跟踪具有连贯嵌入的高度多元数据集动态 (Index $t$-SNE: Tracking Dynamics of High-Dimensional Datasets with Coherent Embeddings)

from arxiv, International Conference on Big Data Visual Analytics (ICBDVA), Venice, Italy, August 12-13 2021 https://publications.waset.org/pdf/10012177 Best paper award

$t$-SNE is an embedding method that the data science community has widely Two interesting characteristics of t-SNE are the structure preservation property and the answer to the crowding problem, where all neighbors in high dimensional space cannot be represented correctly in low dimensional space. $t$-SNE preserves the local neighborhood, and similar items are nicely spaced by adjusting to the local density. These two characteristics produce a meaningful representation, where the cluster area is proportional to its size in number, and relationships between clusters are materialized by closeness on the embedding. This algorithm is non-parametric, therefore two initializations of the algorithm would lead to two different embedding. In a forensic approach, analysts would like to compare two or more datasets using their embedding. An approach would be to learn a parametric model over an embedding built with a subset of data. While this approach is highly scalable, points could be mapped at the same exact position, making them indistinguishable. This type of model would be unable to adapt to new outliers nor concept drift. This paper presents a methodology to reuse an embedding to create a new one, where cluster positions are preserved. The optimization process minimizes two costs, one relative to the embedding shape and the second relative to the support embedding' match. The proposed algorithm has the same complexity than the original $t$-SNE to embed new items, and a lower one when considering the embedding of a dataset sliced into sub-pieces. The method showed promising results on a real-world dataset, allowing to observe the birth, evolution and death of clusters. The proposed approach facilitates identifying significant trends and changes, which empowers the monitoring high dimensional datasets' dynamics.

翻译：$t$- SNE 是一种嵌入方法, 数据科学界拥有一种嵌入方法, 数据科学界拥有一种嵌入方法。 T- SNE 有两个有趣的特性, 结构保存属性和对挤问题的答复, 高维空间的所有邻居在低维空间中都无法正确代表。 $t$- SNE 保存本地邻居, 类似项目则通过调整本地密度来保持良好的空间。这两种特性可以产生有意义的表达方式, 集群区域与其数量大小成正比, 集群区域之间的关系通过嵌入嵌入方式的近距离来实现。这种算法是非参数的, 因此, 算法的两个缩入过程将导致两种不同的嵌入。在法学方法中, 分析者希望用嵌入方式比较两个或两个以上的死亡数据集。一种方法是在嵌入与一组数据相嵌入时学习一个参数模型。虽然这个方法非常可缩放, 点可以与精确的位置相映射, 使它们无法被分辨取。这种模型将显示如何适应新的外向或概念流流。这张论文展示了一种方法, 将重新重新使用一个嵌入一个数据到一个新的嵌入到一个模型, 将一个序列到一个新的缩到一个模型, 将一个数据转换到一个序列到一个新的数据到一个序列到一个模型, 。