We introduce a new large-scale NLI benchmark dataset, collected via an iterative, adversarial human-and-model-in-the-loop procedure. We show that training models on this new dataset leads to state-of-the-art performance on a variety of popular NLI benchmarks, while posing a more difficult challenge with its new test set. Our analysis sheds light on the shortcomings of current state-of-the-art models, and shows that non-expert annotators are successful at finding their weaknesses. The data collection method can be applied in a never-ending learning scenario, becoming a moving target for NLU, rather than a static benchmark that will quickly saturate.
翻译:我们引入了一个新的大规模NLI基准数据集,该数据库通过一个反复的、对抗性的、人和模范的现成流程程序收集。 我们显示,关于这个新数据集的培训模式导致在各种流行的NLI基准上取得最新业绩,同时对其新的测试集提出了更困难的挑战。 我们的分析揭示了当前最新模型的缺点,并表明非专家顾问成功地找到了它们的弱点。 数据收集方法可以应用到一个永无止境的学习情景中,成为NLU的一个移动目标,而不是一个快速饱和的静态基准。