In modern relational machine learning it is common to encounter large graphs that arise via interactions or similarities between observations in many domains. Further, in many cases the target entities for analysis are actually signals on such graphs. We propose to compare and organize such datasets of graph signals by using an earth mover's distance (EMD) with a geodesic cost over the underlying graph. Typically, EMD is computed by optimizing over the cost of transporting one probability distribution to another over an underlying metric space. However, this is inefficient when computing the EMD between many signals. Here, we propose an unbalanced graph earth mover's distance that efficiently embeds the unbalanced EMD on an underlying graph into an $L^1$ space, whose metric we call unbalanced diffusion earth mover's distance (UDEMD). This leads us to an efficient nearest neighbors kernel over many signals defined on a large graph. Next, we show how this gives distances between graph signals that are robust to noise. Finally, we apply this to organizing patients based on clinical notes who are modelled as signals on the SNOMED-CT medical knowledge graph, embedding lymphoblast cells modeled as signals on a gene graph, and organizing genes modeled as signals over a large peripheral blood mononuclear (PBMC) cell graph. In each case, we show that UDEMD-based embeddings find accurate distances that are highly efficient compared to other methods.
翻译:在现代关系机器学习中,通常会遇到通过许多领域观测的相互作用或相似性产生的大图表。此外,在许多情况下,用于分析的目标实体实际上是这些图表上的信号。我们提议使用一个地球移动者距离(EMD)来比较和组织图形信号的数据集,其成本高于底图的大地移动者距离(EMD)的大地移动者距离。通常,EMD是通过优化将一个概率分布传送到另一个概率分布到一个基本测量空间的成本来计算的。然而,在计算许多信号之间计算 EMD时,这是效率低下的。在这里,我们提议一个不平衡的图形地球移动者距离,将不平衡的EMD有效地嵌入一个基图上的信号。我们建议用一个基数来比较和结构,将它嵌入一个基数的分布不均匀的地球移动器距离(UDEDDDDD)。这让我们找到一个高效的图形信号,在SNOMED-CT医学知识图中,将精确的EMDM的距离有效地嵌入一个基数,将一个基数比基数的基数的基数模型,将一个基数的基数的基数模型显示一个高基数的基数。