Previous work on concept identification in neural representations has focused on linear concept subspaces and their neutralization. In this work, we formulate the notion of linear guardedness -- the inability to directly predict a given concept from the representation -- and study its implications. We show that, in the binary case, the neutralized concept cannot be recovered by an additional linear layer. However, we point out that -- contrary to what was implicitly argued in previous works -- multiclass softmax classifiers can be constructed that indirectly recover the concept. Thus, linear guardedness does not guarantee that linear classifiers do not utilize the neutralized concepts, shedding light on theoretical limitations of linear information removal methods.
翻译:以前在神经表示中确定概念的工作侧重于线性概念子空间及其中性。在这项工作中,我们提出了线性保护概念 -- -- 无法直接从表示中预测一个特定概念 -- -- 并研究其影响。我们表明,在二进制情况下,中性概念无法再用一个线性层来恢复。然而,我们指出,与以前工作所隐含的论点相反,可以建造多级软式分类器,间接恢复这一概念。因此,线性保护并不能保证线性分类器不利用中性概念,从而说明线性信息删除方法的理论局限性。