Training generative models that capture rich semantics of the data and interpreting the latent representations encoded by such models are very important problems in unsupervised learning. In this work, we provide a simple algorithm that relies on perturbation experiments on latent codes of a pre-trained generative autoencoder to uncover a causal graph that is implied by the generative model. We leverage pre-trained attribute classifiers and perform perturbation experiments to check for influence of a given latent variable on a subset of attributes. Given this, we show that one can fit an effective causal graph that models a structural equation model between latent codes taken as exogenous variables and attributes taken as observed variables. One interesting aspect is that a single latent variable controls multiple overlapping subsets of attributes unlike conventional approach that tries to impose full independence. Using a pre-trained RNN-based generative autoencoder trained on a dataset of peptide sequences, we demonstrate that the learnt causal graph from our algorithm between various attributes and latent codes can be used to predict a specific property for sequences which are unseen. We compare prediction models trained on either all available attributes or only the ones in the Markov blanket and empirically show that in both the unsupervised and supervised regimes, typically, using the predictor that relies on Markov blanket attributes generalizes better for out-of-distribution sequences.
翻译:在这项工作中,我们提供了一种简单的算法,它依赖于对经过训练的基因自动编码机的潜伏代码进行扰动实验,以发现一个由基因化模型隐含的因果图。我们利用预先训练的属性分类仪,并进行扰动实验,以检查特定隐伏变量对一组属性的影响。有鉴于此,我们证明,可以使用一个有效的因果图,在作为外源变量和作为观察到变量的属性的潜伏代码之间建模结构方程式。一个有趣的方面是,单一的潜伏变量控制着与试图将完全独立强加于人的常规方法不同的各种属性的多重重叠子集。我们利用经过预先训练的RNN(基于普塔德序列数据集的基因化自动编码),我们用我们从各种属性和潜在代码之间算出的因果图可以用来预测一个不可见的序列的具体属性。我们比较了所有现有属性的非潜在等式预测模型,或者只是用普通的、有监督的、有监督的、有监督的版、有监督的、有监督的、有监督的、有监督的、有监督的版本的机能的特性的模型。