Neural network models trained on text data have been found to encode undesired linguistic or sensitive attributes in their representation. Removing such attributes is non-trivial because of a complex relationship between the attribute, text input, and the learnt representation. Recent work has proposed post-hoc and adversarial methods to remove such unwanted attributes from a model's representation. Through an extensive theoretical and empirical analysis, we show that these methods can be counter-productive: they are unable to remove the attributes entirely, and in the worst case may end up destroying all task-relevant features. The reason is the methods' reliance on a probing classifier as a proxy for the attribute. Even under the most favorable conditions when an attribute's features in representation space can alone provide 100% accuracy for learning the probing classifier, we prove that post-hoc or adversarial methods will fail to remove the attribute correctly. These theoretical implications are confirmed by empirical experiments on models trained on synthetic, Multi-NLI, and Twitter datasets. For sensitive applications of attribute removal such as fairness, we recommend caution against using these methods and propose a spuriousness metric to gauge the quality of the final classifier.
翻译:在文本数据方面受过培训的神经网络模型已经发现,这些模型可以在其表达方式中将不理想的语言或敏感属性编码为不理想的语言或敏感属性。 去除这些属性并非三重性, 因为属性、 文本输入和所学的表达方式之间存在复杂的关系。 最近的工作提出了从模型的表达方式中删除这些不想要的属性的后热和对抗方法。 我们通过广泛的理论和经验分析, 表明这些方法可能会起反作用: 它们无法完全删除属性, 在最坏的情况下, 它们可能最终摧毁所有任务相关的特性。 原因是这些方法依赖一个源分类器作为属性的代理。 即便在最有利的条件下, 某个属性在代表空间中的特性能够单独提供100%的精度来学习标本分类器。 我们证明, 后热或对抗性方法无法正确消除属性。 这些理论影响得到在合成、 多NLI和Twitter数据集培训的模型实验的实验的证实。 关于识别属性删除的敏感应用, 如公平性, 我们建议谨慎避免使用这些方法, 并提议用一个模糊性衡量最终分类质量的标准 。