基于信息树-瓦瑟斯坦距离的样本与特征联合分层表示学习 (Joint Hierarchical Representation Learning of Samples and Features via Informed Tree-Wasserstein Distance)

High-dimensional data often exhibit hierarchical structures in both modes: samples and features. Yet, most existing approaches for hierarchical representation learning consider only one mode at a time. In this work, we propose an unsupervised method for jointly learning hierarchical representations of samples and features via Tree-Wasserstein Distance (TWD). Our method alternates between the two data modes. It first constructs a tree for one mode, then computes a TWD for the other mode based on that tree, and finally uses the resulting TWD to build the second mode's tree. By repeatedly alternating through these steps, the method gradually refines both trees and the corresponding TWDs, capturing meaningful hierarchical representations of the data. We provide a theoretical analysis showing that our method converges. We show that our method can be integrated into hyperbolic graph convolutional networks as a pre-processing technique, improving performance in link prediction and node classification tasks. In addition, our method outperforms baselines in sparse approximation and unsupervised Wasserstein distance learning tasks on word-document and single-cell RNA-sequencing datasets.

翻译：高维数据通常在样本和特征两种模式上均呈现分层结构。然而，现有的大多数分层表示学习方法仅能单独处理其中一种模式。本研究提出一种无监督方法，通过树-瓦瑟斯坦距离联合学习样本与特征的分层表示。该方法在两种数据模式间交替进行：首先为一种模式构建树结构，随后基于该树为另一模式计算TWD，最终利用所得TWD构建第二种模式的树结构。通过反复交替执行这些步骤，该方法逐步优化两棵树及对应的TWD，从而捕捉数据中具有意义的分层表示。我们提供了理论分析证明该方法的收敛性。实验表明，本方法可作为预处理技术整合到双曲图卷积网络中，在链接预测和节点分类任务中提升性能。此外，在词-文档数据集和单细胞RNA测序数据集上，本方法在稀疏近似和无监督瓦瑟斯坦距离学习任务中均优于基线方法。