This paper introduces the Gene Mover's Distance, a measure of similarity between a pair of cells based on their gene expression profiles obtained via single-cell RNA sequencing. The underlying idea of the proposed distance is to interpret the gene expression array of a single cell as a discrete probability measure. The distance between two cells is hence computed by solving an Optimal Transport problem between the two corresponding discrete measures. In the Optimal Transport model, we use two types of cost function for measuring the distance between a pair of genes. The first cost function exploits a gene embedding, called gene2vec, which is used to map each gene to a high dimensional vector: the cost of moving a unit of mass of gene expression from a gene to another is set to the Euclidean distance between the corresponding embedded vectors. The second cost function is based on a Pearson distance among pairs of genes. In both cost functions, the more two genes are correlated, the lower is their distance. We exploit the Gene Mover's Distance to solve two classification problems: the classification of cells according to their condition and according to their type. To assess the impact of our new metric, we compare the performances of a $k$-Nearest Neighbor classifier using different distances. The computational results show that the Gene Mover's Distance is competitive with the state-of-the-art distances used in the literature.
翻译:本文介绍 Gene Moler 距离, 这是根据单个细胞 RNA 排序获得的基因表达特征测量的一对细胞之间的相似性。 提议距离的基本想法是将单个细胞的基因表达阵列解释为离散概率测量。 因此, 两个细胞之间的距离是通过解决两个相应的离散测量之间的最佳迁移问题来计算的。 在最佳运输模型中, 我们使用两种成本函数来测量一对基因之间的距离。 第一个成本函数利用基因嵌入, 称为 gene2vec, 用来将每个基因映射成一个高维向量的矢量: 将一个基因表达质量单位从一个基因移到另一个细胞的成本被设定在相应的嵌入矢量之间的 Euclidean 距离上。 第二个成本函数基于两个基因对子之间的Pearson距离。 在这两个成本函数中, 两个基因的关联性越大, 距离就越低。 我们利用 Gene Moler 距离来解决两个分类问题: 根据每个基因的状态和Nearest liver矢量 的距离来进行细胞分类, 。 将我们用基因的内位 的内位 的内位 显示显示的内位 的内位 的内位 。