Multi-modal high throughput biological data presents a great scientific opportunity and a significant computational challenge. In multi-modal measurements, every sample is observed simultaneously by two or more sets of sensors. In such settings, many observed variables in both modalities are often nuisance and do not carry information about the phenomenon of interest. Here, we propose a multi-modal unsupervised feature selection framework: identifying informative variables based on coupled high-dimensional measurements. Our method is designed to identify features associated with two types of latent low-dimensional structures: (i) shared structures that govern the observations in both modalities and (ii) differential structures that appear in only one modality. To that end, we propose two Laplacian-based scoring operators. We incorporate the scores with differentiable gates that mask nuisance features and enhance the accuracy of the structure captured by the graph Laplacian. The performance of the new scheme is illustrated using synthetic and real datasets, including an extended biological application to single-cell multi-omics.
翻译:多式高载量生物数据是一个巨大的科学机会和重大的计算挑战。在多式测量中,每个样本都同时由两组或两组以上的传感器进行观测。在这种环境下,两种模式中观测到的许多变量往往有麻烦,并不包含有关感兴趣的现象的信息。在这里,我们提议了一个多式、不受监督的特征选择框架:根据结合的高度测量确定信息变量。我们的方法旨在确定两种潜在低度结构的特征:(一) 管理两种模式观测的共享结构,和(二) 仅以一种模式出现的差异结构。为此,我们提议两个基于拉普拉西恩的评分操作员。我们把分数与不同的门结合,以遮盖干扰的干扰特征,提高拉普拉西安图所捕到的结构的准确性。新办法的性能用合成和真实数据集加以说明,包括将生物应用扩大到单细胞多组。</s>