Learning meaningful representations of data that can address challenges such as batch effect correction and counterfactual inference is a central problem in many domains including computational biology. Adopting a Conditional VAE framework, we show that marginal independence between the representation and a condition variable plays a key role in both of these challenges. We propose the Contrastive Mixture of Posteriors (CoMP) method that uses a novel misalignment penalty defined in terms of mixtures of the variational posteriors to enforce this independence in latent space. We show that CoMP has attractive theoretical properties compared to previous approaches, and we prove counterfactual identifiability of CoMP under additional assumptions. We demonstrate state-of-the-art performance on a set of challenging tasks including aligning human tumour samples with cancer cell-lines, predicting transcriptome-level perturbation responses, and batch correction on single-cell RNA sequencing data. We also find parallels to fair representation learning and demonstrate that CoMP is competitive on a common task in the field.
翻译:学习有意义的数据表述方法,可以应对批量效果校正和反事实推断等挑战,这是包括计算生物学在内的许多领域的一个中心问题。采用有条件VAE框架,我们表明代表与条件变量之间的边际独立性在这两项挑战中都起着关键作用。我们建议采用 Posteres 对比混合法(CoMP) 方法,该方法使用变异后遗物混合物的新式配对惩罚来在潜在空间实施这种独立性。我们表明, CoMP 与以往方法相比具有有吸引力的理论属性,而且我们证明,根据其他假设,COMP 具有相反的可识别性。我们在一系列具有挑战性的任务上表现出最先进的表现,包括将人类肿瘤样本与癌症细胞线相匹配,预测分流层扰动反应,并对单细胞RNA测序数据进行批次校正。我们还发现与公平代表性学习平行之处,并证明COM 在外地的一项共同任务上具有竞争力。