Recent work has demonstrated substantial gains in pre-training large-scale unidirectional language models such as the GPT-2, GPT-3, and GPT-neo, followed by fine-tuning on a downstream task. In this paper, we evaluate the performance of the GPT-neo 1.3 billion model for commonsense reasoning tasks. We assess the model performance on six commonsense reasoning benchmark tasks and report the accuracy scores for these tasks. When fine-tuned using the right set of hyperparameters, we obtain competitive scores on three of these tasks but struggle when the dataset size is significantly smaller. The low model performance on a few of these tasks suggests the inherent difficulty in these datasets and since it fails to establish coherent patterns given their limited training samples. We also investigate and substantiate our results using visualization and conduct numerous inference tests to understand the model performance better. Finally, we conduct thorough robustness tests using various methods to gauge the model performance under numerous settings. These findings suggest a promising path for exploring smaller language models than the GPT-3 175 billion model to perform tasks requiring natural language understanding.
翻译:最近的工作表明,在培训诸如GPT-2、GPT-3和GPT-neo等大型单向语言模型前,在对下游任务进行微调之后,在培训前的大型单向语言模型方面取得了显著进展。在本文件中,我们评估了GPT-neo 13亿新元的常识推理任务模型的绩效。我们评估了六种常识推理基准任务模型的绩效,并报告了这些任务的准确度。当使用一套正确的超参数进行微调时,我们在其中三项任务上取得了有竞争力的分数,但在数据集大小大大缩小时却在挣扎。其中几项任务的低位模型性能表明这些数据集存在内在的困难,而且由于这些数据集的训练样本有限,无法建立连贯一致的模式。我们还利用可视化和无数推论测试来调查和证实我们的成果,以便更好地了解模型的绩效。最后,我们用各种方法进行彻底的稳健测试,以衡量在众多环境下的模型绩效。这些研究结果表明,探索比GPT-3,175亿模型更小的语言模型来完成需要自然语言理解的任务有希望的道路。