In this paper, we focus on studying robustness evaluation of Chinese question matching. Most of the previous work on analyzing robustness issue focus on just one or a few types of artificial adversarial examples. Instead, we argue that it is necessary to formulate a comprehensive evaluation about the linguistic capabilities of models on natural texts. For this purpose, we create a Chinese dataset namely DuQM which contains natural questions with linguistic perturbations to evaluate the robustness of question matching models. DuQM contains 3 categories and 13 subcategories with 32 linguistic perturbations. The extensive experiments demonstrate that DuQM has a better ability to distinguish different models. Importantly, the detailed breakdown of evaluation by linguistic phenomenon in DuQM helps us easily diagnose the strength and weakness of different models. Additionally, our experiment results show that the effect of artificial adversarial examples does not work on the natural texts.
翻译:在本文中,我们的重点是研究中国问题匹配的稳健性评估。以前关于稳健性分析问题的大部分工作只集中在一种或几种人工对抗性实例上。相反,我们认为,有必要对自然文本模型的语言能力进行全面评估。为此,我们创建了中国数据集,即DuQM, 其中包含有语言扰动的自然问题,以评价问题匹配模型的稳健性。DuQM包含3个类别和13个子类别,有32种语言扰动。广泛的实验表明DuQM更有能力区分不同的模型。重要的是,DuQM中语言现象的详细分析有助于我们轻易地诊断不同模型的强弱。此外,我们的实验结果显示,人为对抗性实例的效果在自然文本上是行不通的。