In current NLP research, large-scale language models and their abilities are widely being discussed. Some recent works have also found notable failures of these models. Often these failure examples involve complex reasoning abilities. This work focuses on a simple commonsense ability, reasoning about when an action (or its effect) is feasible. To this end, we introduce FeasibilityQA, a question-answering dataset involving binary classification (BCQ) and multi-choice multi-correct questions (MCQ) that test understanding of feasibility. We show that even state-of-the-art models such as GPT-3, GPT-2, and T5 struggle to answer the feasibility questions correctly. Specifically, on MCQ and BCQ questions, GPT-3 achieves an accuracy of just (19%, 62%) and (25%, 64%) in zero-shot and few-shot settings, respectively. We also evaluate models by providing relevant knowledge statements required to answer the question. We find that the additional knowledge leads to a 7% gain in performance, but the overall performance still remains low. These results make one wonder how much commonsense knowledge about action feasibility is encoded in state-of-the-art models and how well they can reason about it.
翻译:在目前国家语言方案的研究中,正在广泛讨论大规模语言模型及其能力。最近的一些工作也发现这些模型有显著的失败。这些失败的例子往往涉及复杂的推理能力。这项工作侧重于简单的常识能力,对行动(或其效果)何时可行进行推理。为此,我们引入了可行性QA,这是一个回答问题的数据集,涉及二元分类(BCQ)和多选取的多校正问题,测试可行性。我们发现,更多的知识导致7%的绩效收益,但总体绩效仍然很低。这些结果使人怀疑,关于行动可行性的常见知识在MMCQ和BCQ问题上的准确度分别为19%、62%和25%、64%,在零发和几发环境中的准确度。我们还通过提供回答问题所需的相关知识声明来评价模型。我们发现,额外的知识导致7%的绩效,但总体绩效仍然很低。这些结果使人怀疑,关于行动可行性的常见知识是如何在状态模型中被编码的。