The task of dimensionality reduction and visualization of high-dimensional datasets remains a challenging problem since long. Modern high-throughput technologies produce newer high-dimensional datasets having multiple views with relatively new data types. Visualization of these datasets require proper methodology that can uncover hidden patterns in the data without affecting the local and global structures within the data. To this end, however, very few such methodology exist, which can realise this task. In this work, we have introduced a novel unsupervised deep neural network model, called NeuroDAVIS, for data visualization. NeuroDAVIS is capable of extracting important features from the data, without assuming any data distribution, and visualize effectively in lower dimension. It has been shown theoritically that neighbourhood relationship of the data in high dimension remains preserved in lower dimension. The performance of NeuroDAVIS has been evaluated on a wide variety of synthetic and real high-dimensional datasets including numeric, textual, image and biological data. NeuroDAVIS has been highly competitive against both t-Distributed Stochastic Neighbor Embedding (t-SNE) and Uniform Manifold Approximation and Projection (UMAP) with respect to visualization quality, and preservation of data size, shape, and both local and global structure. It has outperformed Fast interpolation-based t-SNE (Fit-SNE), a variant of t-SNE, for most of the high-dimensional datasets as well. For the biological datasets, besides t-SNE, UMAP and Fit-SNE, NeuroDAVIS has also performed well compared to other state-of-the-art algorithms, like Potential of Heat-diffusion for Affinity-based Trajectory Embedding (PHATE) and the siamese neural network-based method, called IVIS. Downstream classification and clustering analyses have also revealed favourable results for NeuroDAVIS-generated embeddings.
翻译:高维数据降维和可视化一直是一个具有挑战性的问题。现代高通量技术产生了包含多个视图的新的高维数据集,并具有相对新的数据类型。对这些数据集进行可视化需要一个合适的方法,可以在不影响数据的局部和全局结构的情况下,发现数据中的隐藏模式。然而,目前很少有这样的方法能够实现这项任务。本文中,我们提出了一种新的无监督深度神经网络模型NeuroDAVIS,用于数据可视化。NeuroDAVIS能够从数据中提取重要特征,而不需要假设任何数据分布,并在更低的维度下进行有效的可视化。我们理论上证明了高维数据的邻域关系在降到低维后仍能被保留。我们在包括数值、文本、图像和生物数据在内的大量合成和真实的高维数据集上评估了NeuroDAVIS的性能。与t-Distributed Stochastic Neighbor Embedding (t-SNE)和Uniform Manifold Approximation and Projection (UMAP)相比,NeuroDAVIS在可视化质量、数据大小、形状以及局部和全局结构的保留方面表现非常出色。在大多数高维数据集上,它也超过了t-SNE的变体——基于快速插值的t-SNE (Fit-SNE)的性能。对于生物数据集,除了t-SNE,UMAP和Fit-SNE之外,NeuroDAVIS相对于其他最先进的算法,如基于亲和势的轨迹嵌入方法(PHATE)和基于连锁式神经网络的IVIS方法,也表现良好。下游分类和聚类分析也揭示了NeuroDAVIS产生的嵌入的有利结果。