Representations from large pretrained models such as BERT encode a range of features into monolithic vectors, affording strong predictive accuracy across a multitude of downstream tasks. In this paper we explore whether it is possible to learn disentangled representations by identifying existing subnetworks within pretrained models that encode distinct, complementary aspect representations. Concretely, we learn binary masks over transformer weights or hidden units to uncover subsets of features that correlate with a specific factor of variation; this eliminates the need to train a disentangled model from scratch for a particular task. We evaluate this method with respect to its ability to disentangle representations of sentiment from genre in movie reviews, "toxicity" from dialect in Tweets, and syntax from semantics. By combining masking with magnitude pruning we find that we can identify sparse subnetworks within BERT that strongly encode particular aspects (e.g., toxicity) while only weakly encoding others (e.g., race). Moreover, despite only learning masks, we find that disentanglement-via-masking performs as well as -- and often better than -- previously proposed methods based on variational autoencoders and adversarial training.
翻译:在本文中,我们探讨是否有可能通过在经过事先训练的模型中识别现有分网来了解分解的表达方式,这种模型将不同的、互补的表达方式编码。具体地说,我们学习了变压器重量或隐藏的单元的二进制面罩,以发现与具体变异因素相关的特性子集;这就消除了为某项任务从头到尾训练一个分解的模型的需要。我们评估了这一方法,因为它能够分解电影评论中的情绪表达方式、Tweets的方言中的“毒性”和语义学中的语法。我们发现,通过将遮罩与规模调整相结合,我们可以发现BERT内稀少的分网与某些方面(例如毒性)紧密编码,而只能微弱地将其他方面(例如种族)编码起来。此外,尽管我们只学习了面具,但我们发现分解的分解-制成的表达方式与对抗性变化相比,而且往往比以前提出的方法要好。