验证分类对概念排除和探测不可靠 (Probing Classifiers are Unreliable for Concept Removal and Detection)

Neural network models trained on text data have been found to encode undesirable linguistic or sensitive concepts in their representation. Removing such concepts is non-trivial because of a complex relationship between the concept, text input, and the learnt representation. Recent work has proposed post-hoc and adversarial methods to remove such unwanted concepts from a model's representation. Through an extensive theoretical and empirical analysis, we show that these methods can be counter-productive: they are unable to remove the concepts entirely, and in the worst case may end up destroying all task-relevant features. The reason is the methods' reliance on a probing classifier as a proxy for the concept. Even under the most favorable conditions for learning a probing classifier when a concept's relevant features in representation space alone can provide 100% accuracy, we prove that a probing classifier is likely to use non-concept features and thus post-hoc or adversarial methods will fail to remove the concept correctly. These theoretical implications are confirmed by experiments on models trained on synthetic, Multi-NLI, and Twitter datasets. For sensitive applications of concept removal such as fairness, we recommend caution against using these methods and propose a spuriousness metric to gauge the quality of the final classifier.

翻译：在文本数据方面受过培训的神经网络模型已经发现,在文本数据方面受过培训的神经网络模型可以将不受欢迎的语言或敏感概念编码成其表述方式。由于概念、文本输入和所学的代表性之间的复杂关系,删除这些概念并非三重概念。最近的工作提出了从模型中排除这种不必要的概念的热后和对抗方法。通过广泛的理论和经验分析,我们证明这些方法可能产生反作用:它们无法完全删除这些概念,在最坏的情况下,它们可能最终摧毁所有与任务有关的特征。原因在于这些方法依赖一个标本分类器作为概念的代名。即使在最有利的条件下学习一个标本分类器,而仅空间上一个概念的相关特征可以提供100%的准确性。我们证明,一个标本分类器有可能使用非同感特征,因此后或对抗方法无法正确消除概念。这些理论影响得到在合成、多与国家语言研究所和推特数据集培训模型方面的实验的证实。关于概念删除的敏感应用,例如公平性,我们建议谨慎地使用这些方法,并建议采用最后质量衡量标准。

相关内容

MoDELS

关注 43

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

ICLR 2022杰出论文公布：7篇论文获得，清华朱军课题组摘得

专知会员服务

60+阅读 · 2022年4月22日

零样本文本分类，Zero-Shot Learning for Text Classification

专知会员服务

97+阅读 · 2020年5月31日

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

专知会员服务

95+阅读 · 2020年3月12日

【深度学习表格检测、信息提取和结构化】《Table Detection, Information Extraction and Structuring using Deep Learning》by Vihar Kurama

专知会员服务

38+阅读 · 2020年1月23日