Analysis of single-cell transcriptomics often relies on clustering cells and then performing differential gene expression (DGE) to identify genes that vary between these clusters. These discrete analyses successfully determine cell types and markers; however, continuous variation within and between cell types may not be detected. We propose three topologically-motivated mathematical methods for unsupervised feature selection that consider discrete and continuous transcriptional patterns on an equal footing across multiple scales simultaneously. Eigenscores ($\mathrm{eig}_i$) rank signals or genes based on their correspondence to low-frequency intrinsic patterning in the data using the spectral decomposition of the graph Laplacian. The multiscale Laplacian score (MLS) is an unsupervised method for locating relevant scales in data and selecting the genes that are coherently expressed at these respective scales. The persistent Rayleigh quotient (PRQ) takes data equipped with a filtration, allowing separation of genes with different roles in a bifurcation process (e.g. pseudo-time). We demonstrate the utility of these techniques by applying them to published single-cell transcriptomics data sets. The methods validate previously identified genes and detect additional genes with coherent expression patterns. By studying the interaction between gene signals and the geometry of the underlying space, the three methods give multidimensional rankings of the genes and visualisation of relationships between them.
翻译:单细胞笔录组别的分析往往依靠集群细胞,然后进行不同的多层次基因表达(DGE),以确定不同组别之间的基因。这些离散分析成功地决定了细胞类型和标记;然而,细胞类型内部和之间可能无法检测出持续的差异;我们建议了三种由地层驱动的数学方法,用于不监督的特征选择,这些方法既考虑不同和连续的笔录模式,又在多个尺度上同等地考虑离散和连续的笔录模式。Eigenscore 数据( mathrm{eig ⁇ i$),根据它们与低频率内在型图 Laplacian 的对称( DGEGE) 的对应,确定不同基因的等级信号或基因。我们用光谱分谱解的光谱和图谱分析法,通过将这些技术的实用性运用于先前公布的单细胞笔录的基因图解和基因图解的基因图象学分析方法,我们展示了这些技术的效用。