Modern single-cell flow and mass cytometry technologies measure the expression of several proteins of the individual cells within a blood or tissue sample. Each profiled biological sample is thus represented by a set of hundreds of thousands of multidimensional cell feature vectors, which incurs a high computational cost to predict each biological sample's associated phenotype with machine learning models. Such a large set cardinality also limits the interpretability of machine learning models due to the difficulty in tracking how each individual cell influences the ultimate prediction. We propose using Kernel Mean Embedding to encode the cellular landscape of each profiled biological sample. Although our foremost goal is to make a more transparent model, we find that our method achieves comparable or better accuracies than the state-of-the-art gating-free methods through a simple linear classifier. As a result, our model contains few parameters but still performs similarly to deep learning models with millions of parameters. In contrast with deep learning approaches, the linearity and sub-selection step of our model makes it easy to interpret classification results. Analysis further shows that our method admits rich biological interpretability for linking cellular heterogeneity to clinical phenotype.
翻译:现代单细胞流动和质量细胞测量技术测量了血液或组织样本中个别细胞数种蛋白的表达方式。 因此,每个剖面生物样本代表着数以十万计的多维细胞特性矢量。 这在计算上成本很高,可以预测每个生物样本与机器学习模型相关的苯型。 如此庞大的设定基点还限制了机器学习模型的可解释性, 原因是难以跟踪每个细胞如何影响最终预测。 我们提议使用 Kernel Meine 嵌入来编码每个剖面生物样本的细胞景观。 虽然我们的首要目标是制作一个更加透明的模型,但我们发现我们的方法通过简单的线性分类器, 与最先进的无光化方法相比, 实现了相似或更好的理解性。 结果, 我们的模式包含的参数不多,但仍与数百万参数的深度学习模型类似。 与深层次的学习方法相比, 我们模型的直线性和次选择步骤很容易解释分类结果。 进一步分析表明,我们的方法承认将细胞遗传基因与临床基因类型联系起来具有丰富的生物解释能力。