Modern single-cell flow and mass cytometry technologies measure the expression of several proteins of the individual cells within a blood or tissue sample. Each profiled biological sample is thus represented by a set of hundreds of thousands of multidimensional cell feature vectors, which incurs a high computational cost to predict each biological sample's associated phenotype with machine learning models. Such a large set cardinality also limits the interpretability of machine learning models due to the difficulty in tracking how each individual cell influences the ultimate prediction. Using Kernel Mean Embedding to encode the cellular landscape of each profiled biological sample, we can train a simple linear classifier and achieve state-of-the-art classification accuracy on 3 flow and mass cytometry datasets. Our model contains few parameters but still performs similarly to deep learning models with millions of parameters. In contrast with deep learning approaches, the linearity and sub-selection step of our model make it easy to interpret classification results. Clustering analysis further shows that our method admits rich biological interpretability for linking cellular heterogeneity to clinical phenotype.
翻译:现代单细胞流动和质量细胞测量技术测量了血液或组织样本中个别细胞若干蛋白的表达方式。因此,每个剖面生物样本都由数以十万计的多维细胞特性矢量组成,这给预测每个生物样本与机器学习模型相关的苯型带来了很高的计算成本。这种庞大的设定基点还限制了机器学习模型的可解释性,因为难以跟踪每个细胞如何影响最终预测。利用内核嵌入来编码每个剖面生物样本的细胞景观,我们可以训练一个简单的线性分类器,在3个流动和质量细胞测量数据集上达到最先进的分类精确度。我们的模型包含一些参数,但仍然与数百万参数的深度学习模型类似。与深层学习方法相比,我们模型的内置性和次选步骤更容易解释分类结果。分组分析进一步表明,我们的方法承认将细胞异性与临床细胞型计算机类型联系起来的丰富生物解释性。