To create models that are robust across a wide range of test inputs, training datasets should include diverse examples that span numerous phenomena. Dynamic adversarial data collection (DADC), where annotators craft examples that challenge continually improving models, holds promise as an approach for generating such diverse training sets. Prior work has shown that running DADC over 1-3 rounds can help models fix some error types, but it does not necessarily lead to better generalization beyond adversarial test data. We argue that running DADC over many rounds maximizes its training-time benefits, as the different rounds can together cover many of the task-relevant phenomena. We present the first study of longer-term DADC, where we collect 20 rounds of NLI examples for a small set of premise paragraphs, with both adversarial and non-adversarial approaches. Models trained on DADC examples make 26% fewer errors on our expert-curated test set compared to models trained on non-adversarial data. Our analysis shows that DADC yields examples that are more difficult, more lexically and syntactically diverse, and contain fewer annotation artifacts compared to non-adversarial examples.
翻译:为了在广泛的测试投入中创建强有力的模型,培训数据集应包括多种现象的多种实例。动态对抗性数据收集(DDC),在动态对抗性数据收集(DDC)中,对不断改进模型提出挑战的范例,作为产生这种多样化的培训组合的一种方法,很有希望。先前的工作表明,运行1至3轮DDC能够帮助模型解决某些错误类型,但不一定导致比对抗性测试数据更好的概括化。我们认为,运行多轮DDC将培训时间的好处最大化,因为不同回合可以同时涵盖许多与任务有关的现象。我们提出了长期的DDC(DDC)的首次研究,我们收集了20轮NLI实例,用于一套小型的前提段落,既有对抗性做法也有非对抗性做法。经过培训的DADC范例比在非对抗性数据方面培训的模型少出26%的误差。我们的分析表明,DDC提供的例子比非对抗性数据模型更困难、更通俗和相互交织,并且比非对抗性实例少包含说明性手法。