While large language models (LLMs) like GPT-3 have achieved impressive results on multiple choice question answering (MCQA) tasks in the zero, one, and few-shot settings, they generally lag behind the MCQA state of the art (SOTA). MCQA tasks have traditionally been presented to LLMs like cloze tasks. An LLM is conditioned on a question (without the associated answer options) and its chosen option is the one assigned the highest probability after normalization (for length, etc.). A more natural prompting approach is to present the question and answer options to the LLM jointly and have it output the symbol (e.g., "A") associated with its chosen answer option. This approach allows the model to explicitly compare answer options, reduces computational costs, and mitigates the effects of tokenization scheme and answer option representations on answer selection. For the natural approach to be effective, the LLM it is used with must be able to associate answer options with the symbols that represent them. The LLM needs what we term multiple choice symbol binding (MCSB) ability. This ability varies greatly by model. We show that a model with high MCSB ability performs much better with the natural approach than with the traditional approach across 20 diverse datasets and largely closes the gap with the SOTA, suggesting that the MCQA ability of LLMs has been previously underestimated.
翻译:尽管像GPT-3这样的大型语言模型在零、一和小样本学习情况下在多项选择题回答任务上取得了令人印象深刻的成果,但它们通常落后于目前的多项选择题回答最佳表现。多项选择题回答任务通常像填空一样被呈现给大型语言模型。一个大型语言模型以一个问题(没有相关的答案选项)为条件,其选择的选项是在归一化后(因为长度等因素)分配的最高概率。更自然的提示方法是将问题和答案选项一起呈现给大型语言模型,并输出与其选择的答案选项相关联的符号(如“A”)。这种方法允许模型明确地比较答案选项,减少计算成本,并减轻分词方案和答案选项表示对答案选择的影响。为了使自然的方法有效,所使用的大型语言模型必须能够将答案选项与表示它们的符号相关联。模型需要具有多项选择符号绑定(MCSB)能力。这种能力因模型而异。我们展示了一个具有高MCSB能力的模型,在20个不同的数据集上使用自然方法表现得比传统方法好得多,并且在很大程度上缩小了与SOTA之间的差距,这表明先前低估了大型语言模型的多项选择题回答能力。