GPT-3 models are very powerful, achieving high performance on a variety of natural language processing tasks. However, there is a relative lack of detailed published analysis on how well they perform on the task of grammatical error correction (GEC). To address this, we perform experiments testing the capabilities of a GPT-3 model (text-davinci-003) against major GEC benchmarks, comparing the performance of several different prompts, including a comparison of zero-shot and few-shot settings. We analyze intriguing or problematic outputs encountered with different prompt formats. We report the performance of our best prompt on the BEA-2019 and JFLEG datasets using a combination of automatic metrics and human evaluations, revealing interesting differences between the preferences of human raters and the reference-based automatic metrics.
翻译:GPT-3 模型具有非常强大的能力,在各种自然语言处理任务中表现非常出色。然而,在语法纠错(GEC)任务上,已发布的详细分析相对较少。为了解决这个问题,我们进行了实验,测试 GPT-3 模型(text-davinci-003)在主要 GEC 基准测试上的能力,比较了几种不同的提示,包括零/少量样本设置的比较。我们分析了使用不同提示格式遇到的有趣或有问题的输出。我们报告使用自动度量和人工评估组合测量我们最佳提示在 BEA-2019 和 JFLEG 数据集上的表现,揭示人类评定者和基于参考的自动指标之间的有趣差异。