Although large language models (LLMs) often produce impressive outputs, it remains unclear how they perform in real-world scenarios requiring strong reasoning skills and expert domain knowledge. We set out to investigate whether GPT-3.5 (Codex and InstructGPT) can be applied to answer and reason about difficult real-world-based questions. We utilize two multiple-choice medical exam questions (USMLE and MedMCQA) and a medical reading comprehension dataset (PubMedQA). We investigate multiple prompting scenarios: Chain-of-Thought (CoT, think step-by-step), zero- and few-shot (prepending the question with question-answer exemplars) and retrieval augmentation (injecting Wikipedia passages into the prompt). For a subset of the USMLE questions, a medical expert reviewed and annotated the model's CoT. We found that InstructGPT can often read, reason and recall expert knowledge. Failure are primarily due to lack of knowledge and reasoning errors and trivial guessing heuristics are observed, e.g.\ too often predicting labels A and D on USMLE. Sampling and combining many completions overcome some of these limitations. Using 100 samples, Codex 5-shot CoT not only gives close to well-calibrated predictive probability but also achieves human-level performances on the three datasets. USMLE: 60.2%, MedMCQA: 62.7% and PubMedQA: 78.2%.
翻译:虽然大型语言模型(LLMS)往往产生令人印象深刻的产出,但人们仍然不清楚它们在现实世界情景中如何运作,需要很强的推理技巧和专家领域知识。我们准备调查GPT-3.5(Codex和教官GPT)是否可以用于回答和解释基于现实的难题。我们使用两个多重选择的医学考试问题(USMLE和MedMCQA)和医学阅读理解数据集(PubMedQA)。我们调查了多种提示性情景:链路(CoT, 逐步思考),零和少见(先用问答表解器)和检索增强(将维基百科百分解的段落输入到快速。对于美国MLE问题的一部分,一位医学专家经过审查和附加了模型的 CoT。我们发现,GPT常常可以阅读、解释和回忆专家知识。我们调查失败的主要原因是缺乏知识和推理错误,人们也只是粗略地猜测,例如,太经常预测USMLE的A和DPIML2的等级:Sambal-Cs 3的精确的精确的精确性能度,也使这些模型得以克服了。