Finetuning large pre-trained language models with a task-specific head has advanced the state-of-the-art on many natural language understanding benchmarks. However, models with a task-specific head require a lot of training data, making them susceptible to learning and exploiting dataset-specific superficial cues that do not generalize to other datasets. Prompting has reduced the data requirement by reusing the language model head and formatting the task input to match the pre-training objective. Therefore, it is expected that few-shot prompt-based models do not exploit superficial cues. This paper presents an empirical examination of whether few-shot prompt-based models also exploit superficial cues. Analyzing few-shot prompt-based models on MNLI, SNLI, HANS, and COPA has revealed that prompt-based models also exploit superficial cues. While the models perform well on instances with superficial cues, they often underperform or only marginally outperform random accuracy on instances without superficial cues.
翻译:将经过培训的大型语言模型与具体任务对象相匹配,从而改进了在许多自然语言理解基准方面的先进水平,然而,具有特定任务对象的模型需要大量培训数据,使其容易学习和利用数据集特有的表面线索,而这些数据并不与其他数据集相容。提示通过重新使用语言模型头目和任务输入格式来与培训前目标相匹配,减少了数据要求。因此,预计少见的速效快速模型不会利用肤浅的提示。本文对少见的速效快速模型是否也利用肤浅的提示进行了实证性研究。分析出在MNLI、SNLI、HANNS和COPA上少数速效速效速效快速模型也利用了浅色提示。虽然这些模型在有浅色提示的情况下表现良好,但它们往往不完美,或者仅略微超出没有浅色提示的例子的随机准确性。