Harnessing logical reasoning ability is a comprehensive natural language understanding endeavor. With the release of Generative Pretrained Transformer 4 (GPT-4), highlighted as "advanced" at reasoning tasks, we are eager to learn the GPT-4 performance on various logical reasoning tasks. This report analyses multiple logical reasoning datasets, with popular benchmarks like LogiQA and ReClor, and newly-released datasets like AR-LSAT. We test the multi-choice reading comprehension and natural language inference tasks with benchmarks requiring logical reasoning. We further construct a logical reasoning out-of-distribution dataset to investigate the robustness of ChatGPT and GPT-4. We also make a performance comparison between ChatGPT and GPT-4. Experiment results show that ChatGPT performs significantly better than the RoBERTa fine-tuning method on most logical reasoning benchmarks. With early access to the GPT-4 API we are able to conduct intense experiments on the GPT-4 model. The results show GPT-4 yields even higher performance on most logical reasoning datasets. Among benchmarks, ChatGPT and GPT-4 do relatively well on well-known datasets like LogiQA and ReClor. However, the performance drops significantly when handling newly released and out-of-distribution datasets. Logical reasoning remains challenging for ChatGPT and GPT-4, especially on out-of-distribution and natural language inference datasets. We release the prompt-style logical reasoning datasets as a benchmark suite and name it LogiEval.
翻译:利用逻辑推理能力是一个广泛的自然语言理解任务。随着Generative Pretrained Transformer 4 (GPT-4)的发布,在推理任务上被标记为“高级”,我们渴望了解GPT-4在各种逻辑推理任务上的表现。该报告分析了多个逻辑推理数据集,包括流行的LogiQA和ReClor基准测试,以及新发布的AR-LSAT数据集。我们使用需要逻辑推理的基准测试来测试多项选择阅读理解和自然语言推理任务。此外,我们还构建了一个逻辑推理超出分布数据集,以研究ChatGPT和GPT-4的稳健性。我们还对比了ChatGPT和GPT-4的性能。实验结果表明,ChatGPT在大多数逻辑推理基准测试中的表现明显优于RoBERTa精调方法。通过提前访问GPT-4 API,我们能够对GPT-4模型进行深入的实验。结果表明,GPT-4在大多数逻辑推理数据集上的性能表现更好。在基准测试中,ChatGPT和GPT-4在像LogiQA和ReClor这样的知名数据集上表现得相对较好。但是,在处理新发布的和超出分布数据集时,性能显著下降。逻辑推理对于ChatGPT和GPT-4仍然具有挑战性,特别是在超出分布和自然语言推理数据集上。我们发布了提示式逻辑推理数据集作为基准测试套件,并将其命名为LogiEval。