单细胞数据中信号选择的多尺度方法 (Multiscale methods for signal selection in single-cell data)

from arxiv, 32 pages, 15 figures, 1 table. Revised and published in Entropy, special issue Applications of Topological Data Analysis in the Life Sciences

Analysis of single-cell transcriptomics often relies on clustering cells and then performing differential gene expression (DGE) to identify genes that vary between these clusters. These discrete analyses successfully determine cell types and markers; however, continuous variation within and between cell types may not be detected. We propose three topologically motivated mathematical methods for unsupervised feature selection that consider discrete and continuous transcriptional patterns on an equal footing across multiple scales simultaneously. Eigenscores ($\text{eig}_i$) rank signals or genes based on their correspondence to low-frequency intrinsic patterning in the data using the spectral decomposition of the Laplacian graph. The multiscale Laplacian score (MLS) is an unsupervised method for locating relevant scales in data and selecting the genes that are coherently expressed at these respective scales. The persistent Rayleigh quotient (PRQ) takes data equipped with a filtration, allowing the separation of genes with different roles in a bifurcation process (e.g., pseudo-time). We demonstrate the utility of these techniques by applying them to published single-cell transcriptomics data sets. The methods validate previously identified genes and detect additional biologically meaningful genes with coherent expression patterns. By studying the interaction between gene signals and the geometry of the underlying space, the three methods give multidimensional rankings of the genes and visualisation of relationships between them.

翻译：单细胞笔录组别的分析往往依靠集群细胞,然后进行不同的多层次基因表达(DGE),以确定这些组别之间不同的基因。这些离散分析成功地确定了细胞类型和标记;然而,细胞类型内部和之间可能无法检测出持续的差异;我们建议了三种具有地貌动机的数学方法,用于在多个尺度之间平等地考虑离散和连续的笔录模式的不受监督的特征选择。Eigenscores ($\text{eig ⁇ i$),根据它们与数据中低频内在结构的对应性(DGE) 来进行分级信号或基因表达(DGE) 。我们通过应用这些技术在数据中定位相关尺度和选择在这些尺度上一致表达的基因的不受监督的方法,在不同的尺度上进行分级分析,同时考虑离散和连续的笔录模式。持续的Raylegyptaltictent(PRQ) 数据配有过滤,允许将具有不同作用的基因在组合进程中分离(e.gy-time-time),我们通过将这些技术的效用通过应用其先前的直观模式和预测测测测测的基因序列,我们通过应用了这些技术在先前的基因的基因的基因结构和基因结构之间进行了研究,从而对各种的测算。