Neural network models trained on text data have been found to encode undesirable linguistic or sensitive concepts in their representation. Removing such concepts is non-trivial because of a complex relationship between the concept, text input, and the learnt representation. Recent work has proposed post-hoc and adversarial methods to remove such unwanted concepts from a model's representation. Through an extensive theoretical and empirical analysis, we show that these methods can be counter-productive: they are unable to remove the concepts entirely, and in the worst case may end up destroying all task-relevant features. The reason is the methods' reliance on a probing classifier as a proxy for the concept. Even under the most favorable conditions for learning a probing classifier when a concept's relevant features in representation space alone can provide 100% accuracy, we prove that a probing classifier is likely to use non-concept features and thus post-hoc or adversarial methods will fail to remove the concept correctly. These theoretical implications are confirmed by experiments on models trained on synthetic, Multi-NLI, and Twitter datasets. For sensitive applications of concept removal such as fairness, we recommend caution against using these methods and propose a spuriousness metric to gauge the quality of the final classifier.
翻译:在文本数据方面受过培训的神经网络模型已经发现,在文本数据方面受过培训的神经网络模型可以将不受欢迎的语言或敏感概念编码成其表述方式。由于概念、文本输入和所学的代表性之间的复杂关系,删除这些概念并非三重概念。最近的工作提出了从模型中排除这种不必要的概念的热后和对抗方法。通过广泛的理论和经验分析,我们证明这些方法可能产生反作用:它们无法完全删除这些概念,在最坏的情况下,它们可能最终摧毁所有与任务有关的特征。原因在于这些方法依赖一个标本分类器作为概念的代名。即使在最有利的条件下学习一个标本分类器,而仅空间上一个概念的相关特征可以提供100%的准确性。我们证明,一个标本分类器有可能使用非同感特征,因此后或对抗方法无法正确消除概念。这些理论影响得到在合成、多与国家语言研究所和推特数据集培训模型方面的实验的证实。关于概念删除的敏感应用,例如公平性,我们建议谨慎地使用这些方法,并建议采用最后质量衡量标准。