Latent variable models such as the Variational Auto-Encoder (VAE) have become a go-to tool for analyzing biological data, especially in the field of single-cell genomics. One remaining challenge is the interpretability of latent variables as biological processes that define a cell's identity. Outside of biological applications, this problem is commonly referred to as learning disentangled representations. Although several disentanglement-promoting variants of the VAE were introduced and applied to single-cell genomics data, this task has been shown to be infeasible from independent and identically distributed measurements, without additional structure. Instead, recent methods propose to leverage non-stationary data, as well as the sparse mechanism shift assumption in order to learn disentangled representations with a causal semantic. Here, we extend the application of these methodological advances to the analysis of single-cell genomics data with genetic or chemical perturbations. More precisely, we propose a deep generative model of single-cell gene expression data for which each perturbation is treated as a stochastic intervention targeting an unknown, but sparse, subset of latent variables. We benchmark these methods on simulated single-cell data to evaluate their performance at latent units recovery, causal target identification and out-of-domain generalization. Finally, we apply those approaches to two real-world large-scale gene perturbation data sets and find that models that exploit the sparse mechanism shift hypothesis surpass contemporary methods on a transfer learning task. We implement our new model and benchmarks using the scvi-tools library, and release it as open-source software at \url{https://github.com/romain-lopez/svae}.
翻译:诸如 VAE 等隐性变异模型( VAE ) 已被引入并应用到单细胞基因组数据中, 这项任务被显示为无法从独立和相同的分布式测量中获取生物数据, 特别是在单细胞基因组学领域。 剩下的挑战之一是潜在变量作为生物过程的可解释性, 以定义细胞的身份。 除生物应用外, 这个问题通常被称为学习分解的表达方式。 虽然 VAE 中的一些分解促进变异体被引入并应用到单细胞基因组数据中, 但这一任务却被显示为无法从独立和相同分布式的分布式测量中获取, 而没有额外的结构。 相反, 最近的一些方法提议利用非静止数据以及分散的机械变换机制假设来利用非静止的变量来理解细胞细胞特性。 我们将这些方法推广到单细胞基因组数据组数据组数据组数据组的分析。 更确切地说, 我们用每个分解的当前变异基因组模型 数据组数据组数据组数据组, 运用于一个不为未知的模型, 然后在普通的模型中, 我们用这些模型组级数据组数据组的模型来测试模型, 最后的模型来评估这些模型, 。