Fact verification datasets are typically constructed using crowdsourcing techniques due to the lack of text sources with veracity labels. However, the crowdsourcing process often produces undesired biases in data that cause models to learn spurious patterns. In this paper, we propose CrossAug, a contrastive data augmentation method for debiasing fact verification models. Specifically, we employ a two-stage augmentation pipeline to generate new claims and evidences from existing samples. The generated samples are then paired cross-wise with the original pair, forming contrastive samples that facilitate the model to rely less on spurious patterns and learn more robust representations. Experimental results show that our method outperforms the previous state-of-the-art debiasing technique by 3.6% on the debiased extension of the FEVER dataset, with a total performance boost of 10.13% from the baseline. Furthermore, we evaluate our approach in data-scarce settings, where models can be more susceptible to biases due to the lack of training data. Experimental results demonstrate that our approach is also effective at debiasing in these low-resource conditions, exceeding the baseline performance on the Symmetric dataset with just 1% of the original data.
翻译:事实核查数据集通常使用众包技术构建,因为缺少具有真实标签的文本源。然而,众包过程往往在数据中产生不理想的偏差,导致模型学习虚假模式。在本文件中,我们提议CrossAug,这是一个反偏向事实核查模型的对比性数据增强方法。具体地说,我们使用一个两阶段增强性能管道,从现有样品中产生新的主张和证据。然后,产生的样品与原始样品配对,形成对比样本,为模型较少依赖虚假模式和学习更强的演示提供便利。实验结果显示,我们的方法在Fever数据集的脱轨扩展中,比先前的艺术减偏向状态技术高出3.6%,比基线值增加10.13%。此外,我们评估了我们在数据偏差环境中的方法,因为缺乏培训数据,模型更容易受到偏差。实验结果表明,我们的方法在这些低资源条件下也有效减少了偏差,超过Symrial数据原位的原位性能,比原位数据高出3.6%。