Prompt engineering and calibration make large language models excel at reasoning tasks, including multiple choice commonsense reasoning. From a practical perspective, we investigate and evaluate these strategies on smaller language models. Through experiments on five commonsense reasoning benchmarks, we find that each strategy favors certain models, but their joint effects are mostly negative.
翻译:提示工程和校准使大语言模型在推理任务中表现出色,包括多项选择的通识推理。从实际角度出发,我们在较小的语言模型上探索和评估这些策略。通过对五个通识推理基准的实验,我们发现每种策略都对某些模型有利,但它们的联合效应大多是负面的。