The representation space of neural models for textual data emerges in an unsupervised manner during training. Understanding how human-interpretable concepts, such as gender, are encoded in these representations would improve the ability of users to \emph{control} the content of these representations and analyze the working of the models that rely on them. One prominent approach to the control problem is the identification and removal of linear concept subspaces -- subspaces in the representation space that correspond to a given concept. While those are tractable and interpretable, neural network do not necessarily represent concepts in linear subspaces. We propose a kernalization of the linear concept-removal objective of [Ravfogel et al. 2022], and show that it is effective in guarding against the ability of certain nonlinear adversaries to recover the concept. Interestingly, our findings suggest that the division between linear and nonlinear models is overly simplistic: when considering the concept of binary gender and its neutralization, we do not find a single kernel space that exclusively contains all the concept-related information. It is therefore challenging to protect against \emph{all} nonlinear adversaries at once.
翻译:文本数据神经模型的表达空间在培训期间以不受监督的方式出现。了解这些表述中如何将性别等人类解释的概念编码为性别等人类解释的概念,将提高用户对这些表述内容的能力,并分析依赖这些表达内容的模型的运作情况。对控制问题的一种突出做法是确定和删除线性概念子空间 -- -- 代表空间中符合特定概念的子空间。虽然这些空间是可移动和可解释的,但神经网络不一定代表线性子空间中的概念。我们建议对[Ravfogel 等2022] 的线性概念清除目标进行骨髓化,并表明它对于防止某些非线性对手恢复概念的能力是有效的。有趣的是,我们的调查结果表明线性和非线性模型之间的区分过于简单化:在考虑二元性别及其中性概念时,我们并不发现一个专门包含所有概念相关信息的单核心空间。因此,在保护非直线性武器时,要挑战一下是否保护。