Single-cell RNA sequencing (scRNA-seq) enables high-resolution analysis of cellular heterogeneity, but its complexity, which is marked by high dimensionality, sparsity, and batch effects, which poses major computational challenges. Transformer-based models have made significant advances in this domain but are often limited by their quadratic complexity and suboptimal handling of long-range dependencies. In this work, we introduce GeneMamba, a scalable and efficient foundation model for single-cell transcriptomics built on state space modeling. Leveraging the Bi-Mamba architecture, GeneMamba captures bidirectional gene context with linear-time complexity, offering substantial computational gains over transformer baselines. The model is pretrained on nearly 30 million cells and incorporates biologically informed objectives, including pathway-aware contrastive loss and rank-based gene encoding. We evaluate GeneMamba across diverse tasks, including multi-batch integration, cell type annotation, and gene-gene correlation, demonstrating strong performance, interpretability, and robustness. These results position GeneMamba as a practical and powerful alternative to transformer-based methods, advancing the development of biologically grounded, scalable tools for large-scale single-cell data analysis.
翻译:单细胞RNA测序(scRNA-seq)能够实现细胞异质性的高分辨率分析,但其复杂性——表现为高维度、稀疏性和批次效应——带来了重大的计算挑战。基于Transformer的模型在该领域已取得显著进展,但常受限于其二次复杂度及对长程依赖关系的次优处理。本研究提出GeneMamba,一个基于状态空间建模构建的可扩展高效单细胞转录组学基础模型。通过利用Bi-Mamba架构,GeneMamba以线性时间复杂度捕获双向基因上下文,相比Transformer基线获得显著计算优势。该模型在近3000万个细胞上进行预训练,并整合了生物学启发的目标函数,包括通路感知对比损失和基于排序的基因编码。我们在多批次整合、细胞类型注释和基因-基因相关性等多种任务上评估GeneMamba,证明了其强大的性能、可解释性和鲁棒性。这些结果使GeneMamba成为基于Transformer方法的实用且强大的替代方案,推动了面向大规模单细胞数据分析的、基于生物学原理的可扩展工具的发展。