Rationalization is fundamental to human reasoning and learning. NLP models trained to produce rationales along with predictions, called self-rationalization models, have been investigated for their interpretability and utility to end-users. However, the extent to which training with human-written rationales facilitates learning remains an under-explored question. We ask whether training models to self-rationalize can aid in their learning to solve tasks for the right reasons. Specifically, we evaluate how training self-rationalization models with free-text rationales affects robustness to spurious correlations in fine-tuned encoder-decoder and decoder-only models of six different sizes. We evaluate robustness to spurious correlations by measuring performance on 1) manually annotated challenge datasets and 2) subsets of original test sets where reliance on spurious correlations would fail to produce correct answers. We find that while self-rationalization can improve robustness to spurious correlations in low-resource settings, it tends to hurt robustness in higher-resource settings. Furthermore, these effects depend on model family and size, as well as on rationale content. Together, our results suggest that explainability can come at the cost of robustness; thus, appropriate care should be taken when training self-rationalizing models with the goal of creating more trustworthy models.
翻译:合理化对于人类的推理和学习来说是根本的。经过培训,可以提供理论依据和预测的NLP模型,称为自我合理化模型,已经调查了这些模型的可解释性和对最终用户的实用性。然而,使用人写理论的培训有助于学习的程度仍是一个未得到充分探讨的问题。我们问,自我合理化培训模型是否有助于他们学习如何解决工作上的适当原因。具体地说,我们评估使用自由文本理论的培训自我合理化模型如何影响稳健性,使其在精细调整的编码-脱coder和6个不同尺寸的只分解码模型中产生虚假的相关性。我们通过测量性能来评估虚假相关性的强性,方法包括:1)手动的附加挑战数据集和2)原始测试组的子集,在依赖刺激性相关性无法产生正确答案的情况下。我们发现,虽然自我合理化可以提高在低资源环境下的僵硬相关性的稳性,但往往损害高资源环境中的稳健性。此外,这些影响取决于模型的家庭和大小,以及理论内容。我们通过测量性强性模型来评估虚假性的相关性,因此,在采取更稳健的模型时,我们可以解释。