Standard NLP tasks do not incorporate several common real-world scenarios such as seeking clarifications about the question, taking advantage of clues, abstaining in order to avoid incorrect answers, etc. This difference in task formulation hinders the adoption of NLP systems in real-world settings. In this work, we take a step towards bridging this gap and present a multi-stage task that simulates a typical human-human questioner-responder interaction such as an interview. Specifically, the system is provided with question simplifications, knowledge statements, examples, etc. at various stages to improve its prediction when it is not sufficiently confident. We instantiate the proposed task in Natural Language Inference setting where a system is evaluated on both in-domain and out-of-domain (OOD) inputs. We conduct comprehensive experiments and find that the multi-stage formulation of our task leads to OOD generalization performance improvement up to 2.29% in Stage 1, 1.91% in Stage 2, 54.88% in Stage 3, and 72.02% in Stage 4 over the standard unguided prediction. However, our task leaves a significant challenge for NLP researchers to further improve OOD performance at each stage.
翻译:标准NLP任务没有包含若干共同的现实情景,例如要求澄清问题、利用线索、为避免错误的回答而弃权等。任务配置的这种差异妨碍了在现实环境中采用NLP系统。在这项工作中,我们迈出了一步,缩小了这一差距,并提出了一个多阶段任务,模拟典型的人类-人类提问者-答应者互动,如面谈等。具体地说,该系统在各个阶段都得到了问题简化、知识陈述、实例等,以便在缺乏足够信心时改进预测。我们在自然语言推论设置中提出了任务,在自然语言推论中,一个系统在现实环境中和外部(OOOD)的投入上都得到了评价。我们进行全面试验,发现我们任务的多阶段拟订工作导致OOD在第二阶段的1.91%、第第二阶段的54.88%、第3阶段的72.02%和第四阶段的72.02%的通用性工作在标准非指导性预测上得到了进一步改进。然而,我们的任务给NLP研究人员在每个阶段进一步改进ODD绩效带来了重大挑战。