Large language models (LLMs) such as ChatGPT and GPT-4 have made significant progress in NLP. However, their ability to memorize, represent, and leverage commonsense knowledge has been a well-known pain point for LLMs. It remains unclear that: (1) Can GPTs effectively answer commonsense questions? (2) Are GPTs knowledgeable in commonsense? (3) Are GPTs aware of the underlying commonsense knowledge for answering a specific question? (4) Can GPTs effectively leverage commonsense for answering questions? To evaluate the above commonsense problems, we conduct a series of experiments to evaluate ChatGPT's commonsense abilities, and the experimental results show that: (1) GPTs can achieve good QA accuracy in commonsense tasks, while they still struggle with certain types of knowledge. (2) ChatGPT is knowledgeable, and can accurately generate most of the commonsense knowledge using knowledge prompts. (3) Despite its knowledge, ChatGPT is an inexperienced commonsense problem solver, which cannot precisely identify the needed commonsense knowledge for answering a specific question, i.e., ChatGPT does not precisely know what commonsense knowledge is required to answer a question. The above findings raise the need to investigate better mechanisms for utilizing commonsense knowledge in LLMs, such as instruction following, better commonsense guidance, etc.
翻译:中文摘要:
像ChatGPT和GPT-4这样的大型语言模型在自然语言处理领域取得了重大进展。然而,其记忆、表征和利用常识知识的能力一直是LLM的痛点。以下问题仍不清楚:(1) GPT能否有效地回答常识性问题?(2) GPT是否具有常识?(3) GPT是否知道回答特定问题所需的常识知识?(4) GPT是否能够有效地利用常识来回答问题?为了评估上述常识问题,我们进行了一系列实验,以评估ChatGPT在常识能力方面的表现,实验结果显示:(1) GPT在常识任务的问答准确率方面表现良好,但仍然无法解决某些类型的知识。 (2) ChatGPT具有知识,并可以使用知识提示准确生成大多数常识性知识。(3)尽管具有知识,但ChatGPT是一个经验不足的常识问题解决器,无法精确地确定回答特定问题所需的常识知识,即ChatGPT不知道回答问题所需的常识知识。上述发现引发了在LLM中利用常识知识的更好机制的研究的需求,例如遵循指令、更好的常识指导等等。