We propose a method to facilitate exploration and analysis of new large data sets. In particular, we give an unsupervised deep learning approach to learning a latent representation that captures semantic similarity in the data set. The core idea is to use data augmentations that preserve semantic meaning to generate synthetic examples of elements whose feature representations should be close to one another. We demonstrate the utility of our method applied to nano-scale electron microscopy data, where even relatively small portions of animal brains can require terabytes of image data. Although supervised methods can be used to predict and identify known patterns of interest, the scale of the data makes it difficult to mine and analyze patterns that are not known a priori. We show the ability of our learned representation to enable query by example, so that if a scientist notices an interesting pattern in the data, they can be presented with other locations with matching patterns. We also demonstrate that clustering of data in the learned space correlates with biologically-meaningful distinctions. Finally, we introduce a visualization tool and software ecosystem to facilitate user-friendly interactive analysis and uncover interesting biological patterns. In short, our work opens possible new avenues in understanding of and discovery in large data sets, arising in domains such as EM analysis.
翻译:我们建议一种方法,以便利对新的大型数据集的探索和分析。特别是,我们提供了一种未经监督的深层次学习方法,以学习一种潜在代表方式,这种代表方式能够捕捉到数据集中的语义相似性。核心想法是使用保留语义含义的数据增强手段,合成特征表达方式应彼此接近的要素的例子。我们展示了我们应用于纳米规模电子显微镜数据的方法的效用,即使相对较小的动物大脑部分也可能需要图象数据百万字节。虽然可以使用监督的方法预测和确定已知的兴趣模式,但数据的规模使得难以探测和分析先前不为人所知的模式。我们展示了我们所学过的代表性能够通过实例进行查询的能力,这样,如果科学家注意到数据中一种有趣的模式,它们就可以与其他具有匹配模式的地点一起展示。我们还表明,在所学空间中的数据组合与生物意义的区别相关联。最后,我们引入了一种视觉化工具和软件生态系统,以便利用户友好的交互式分析并发现有趣的生物模式。简而言,我们的工作开辟了我们所学过的代表性,以便能够通过实例进行查询,这样在大型的域中进行探索和发现。