Scientific datasets often have hierarchical structure: for example, in surveys, individual participants (samples) might be grouped at a higher level (units) such as their geographical region. In these settings, the interest is often in exploring the structure on the unit level rather than on the sample level. Units can be compared based on the distance between their means, however this ignores the within-unit distribution of samples. Here we develop an approach for exploratory analysis of hierarchical datasets using the Wasserstein distance metric that takes into account the shapes of within-unit distributions. We use t-SNE to construct 2D embeddings of the units, based on the matrix of pairwise Wasserstein distances between them. The distance matrix can be efficiently computed by approximating each unit with a Gaussian distribution, but we also provide a scalable method to compute exact Wasserstein distances. We use synthetic data to demonstrate the effectiveness of our Wasserstein t-SNE, and apply it to data from the 2017 German parliamentary election, considering polling stations as samples and voting districts as units. The resulting embedding uncovers meaningful structure in the data.
翻译:科学数据集往往具有等级结构:例如,在调查中,个别参与者(样本)可以按其地理区域等较高层次(单位)分组,在这些环境中,感兴趣的往往是在单位一级而不是抽样一级探索结构。单位可以根据其手段之间的距离进行比较,但这忽略了样本在单位内的分布。我们在这里开发了一种方法,利用瓦瑟斯坦距离标准对等级数据集进行探索性分析,该标准考虑到单位内分布的形状。我们使用t-SNE来根据它们之间的对称瓦瑟斯坦距离矩阵来建造2D单元嵌入。通过对称瓦瑟斯坦距离的矩阵来有效计算距离,但以高斯分布相近的方式计算,我们还提供了一种可缩放的方法来计算准确的瓦瑟斯坦距离。我们使用合成数据来证明我们的瓦瑟斯坦t-SNE的功效,并将这些数据应用于2017年德国议会选举的数据,将投票站视为样品和投票区作为单位。结果揭示了数据中有意义的结构。