Despite the remarkable success deep models have achieved in Textual Matching (TM) tasks, it still remains unclear whether they truly understand language or measure the semantic similarity of texts by exploiting statistical bias in datasets. In this work, we provide a new perspective to study this issue -- via the length divergence bias. We find the length divergence heuristic widely exists in prevalent TM datasets, providing direct cues for prediction. To determine whether TM models have adopted such heuristic, we introduce an adversarial evaluation scheme which invalidates the heuristic. In this adversarial setting, all TM models perform worse, indicating they have indeed adopted this heuristic. Through a well-designed probing experiment, we empirically validate that the bias of TM models can be attributed in part to extracting the text length information during training. To alleviate the length divergence bias, we propose an adversarial training method. The results demonstrate we successfully improve the robustness and generalization ability of models at the same time.
翻译:尽管在文本匹配(TM)任务中取得了显著的成功的深层次模型,但是仍然不清楚它们是否真正理解语言,还是通过利用数据集中的统计偏差来衡量文本的语义相似性。在这项工作中,我们提供了研究这一问题的新视角 -- -- 通过长度差异偏差来研究这一问题。我们发现在流行的TM数据集中普遍存在的长度偏差,为预测提供了直接提示。为了确定TM模型是否采用了这种超常性,我们引入了一种使超常性无效的对称评价办法。在这个对称环境中,所有TM模型都表现得更差,表明它们确实采用了超常性。通过精心设计的模拟实验,我们从经验上证实TM模型的偏差部分可以归因于培训过程中的文字长度信息。为了减轻时间偏差,我们建议了一种对抗性培训方法。结果显示,我们成功地提高了模型的坚固性和概括性。