Gene regulation is a dynamic process that connects genotype and phenotype. Given the difficulty of physically mapping mammalian gene circuitry, we require new computational methods to learn regulatory rules. Natural language is a valuable analogy to the communication of regulatory control. Machine learning systems model natural language by explicitly learning context dependencies between words. We propose a similar system applied to single-cell RNA expression profiles to learn context dependencies between genes. Our model, Exceiver, is trained across a diversity of cell types using a self-supervised task formulated for discrete count data, accounting for feature sparsity. We found agreement between the similarity profiles of latent sample representations and learned gene embeddings with respect to biological annotations. We evaluated Exceiver on a new dataset and a downstream prediction task and found that pretraining supports transfer learning. Our work provides a framework to model gene regulation on a single-cell level and transfer knowledge to downstream tasks.
翻译:基因调节是一个动态过程,它将基因类型和苯菌类型联系起来。鉴于物理测绘哺乳动物基因电路的困难,我们需要新的计算方法来学习监管规则。自然语言是监管控制交流的宝贵类比。机器学习系统通过明确学习言词之间的背景依赖性来模拟自然语言。我们建议对单细胞RNA表达剖面进行类似的系统,以学习基因之间的背景依赖性。我们的模型Exceiver在各种细胞类型中接受培训,使用为离散计数数据、特征宽度核算而设计的自监督任务。我们找到了潜在样本显示相似性特征和生物说明方面知识嵌入基因的类似性特征之间的一致。我们评估了新数据集的Exceiver和下游预测任务,发现培训前支持转移学习。我们的工作为单细胞层次的基因调节模式和向下游任务转让知识提供了一个框架。