Pre-trained language models have been successful on text classification tasks, but are prone to learning spurious correlations from biased datasets, and are thus vulnerable when making inferences in a new domain. Prior works reveal such spurious patterns via post-hoc explanation algorithms which compute the importance of input features. Further, the model is regularized to align the importance scores with human knowledge, so that the unintended model behaviors are eliminated. However, such a regularization technique lacks flexibility and coverage, since only importance scores towards a pre-defined list of features are adjusted, while more complex human knowledge such as feature interaction and pattern generalization can hardly be incorporated. In this work, we propose to refine a learned language model for a target domain by collecting human-provided compositional explanations regarding observed biases. By parsing these explanations into executable logic rules, the human-specified refinement advice from a small set of explanations can be generalized to more training examples. We additionally introduce a regularization term allowing adjustments for both importance and interaction of features to better rectify model behavior. We demonstrate the effectiveness of the proposed approach on two text classification tasks by showing improved performance in target domain as well as improved model fairness after refinement.
翻译:预先培训的语文模式在文本分类任务方面是成功的,但容易从偏颇的数据集中学习虚假的关联,因此在作出新领域的推论时容易受伤害。先前的工作通过计算输入特点重要性的热后解释算法揭示出这种虚假模式。此外,该模式的正规化使重要分数与人类知识相一致,从而消除意外的模型行为。然而,这种正规化技术缺乏灵活性和覆盖面,因为只有对预先界定的特征清单的重要分数得到调整,而更复杂的人类知识,例如特征互动和模式一般化等,很难纳入其中。在这项工作中,我们建议通过收集由人提供的关于观察到的偏见的构成解释来完善一个目标领域的学习语言模式。通过将这些解释分为可执行的逻辑规则,从少量解释中得出的人为改进建议可以普遍化为更多的培训范例。我们还引入一个正规化术语,允许对特征的重要性和相互作用进行调整,以更好地纠正模式行为。我们通过在目标领域显示改进的改进后的模型完善性,来展示拟议的两种文本分类任务的有效性。