The representation space of neural models for textual data emerges in an unsupervised manner during training. Understanding how those representations encode human-interpretable concepts is a fundamental problem. One prominent approach for the identification of concepts in neural representations is searching for a linear subspace whose erasure prevents the prediction of the concept from the representations. However, while many linear erasure algorithms are tractable and interpretable, neural networks do not necessarily represent concepts in a linear manner. To identify non-linearly encoded concepts, we propose a kernelization of a linear minimax game for concept erasure. We demonstrate that it is possible to prevent specific non-linear adversaries from predicting the concept. However, the protection does not transfer to different nonlinear adversaries. Therefore, exhaustively erasing a non-linearly encoded concept remains an open problem.
翻译:文本数据神经模型的表达空间在培训期间以不受监督的方式出现。了解这些表达方式如何将人类解释的概念编码成一个根本问题。在神经表现中确定概念的一个突出办法是寻找一个线性子空间,该子空间的去除使概念无法从表达中预测。然而,虽然许多线性删除算法是可移植和可解释的,但神经网络不一定以线性方式代表概念。为了确定非线性编码概念,我们提议将线性小型游戏的内环化,以便消除概念。我们证明有可能防止特定的非线性对手预测这个概念。然而,保护不会转移给不同的非线性对手。因此,彻底删除非线性编码概念仍然是一个尚未解决的问题。