We introduce STREET, a unified multi-task and multi-domain natural language reasoning and explanation benchmark. Unlike most existing question-answering (QA) datasets, we expect models to not only answer questions, but also produce step-by-step structured explanations describing how premises in the question are used to produce intermediate conclusions that can prove the correctness of a certain answer. We perform extensive evaluation with popular language models such as few-shot prompting GPT-3 and fine-tuned T5. We find that these models still lag behind human performance when producing such structured reasoning steps. We believe this work will provide a way for the community to better train and test systems on multi-step reasoning and explanations in natural language.
翻译:我们引入了统一的多任务和多领域自然语言推理和解释基准STREET。与大多数现有的问答数据集不同,我们期望模型不仅回答问题,而且提出分阶段的结构性解释,说明如何利用问题的前提得出中间结论,证明某种答案是正确的。我们用流行语言模型进行广泛的评价,如几发发促发GPT-3和微调T5等,我们发现这些模型在编制这些结构化推理步骤时仍然落后于人的表现。我们认为,这项工作将为社区提供一种途径,更好地培训和测试关于自然语言的多步推理和解释的系统。