Humor is a salient testbed for human-like creative thinking in large language models (LLMs). We study humor using the Japanese creative response game Oogiri, in which participants produce witty responses to a given prompt, and ask the following research question: What makes such responses funny to humans? Previous work has offered only limited reliable means to answer this question. Existing datasets contain few candidate responses per prompt, expose popularity signals during ratings, and lack objective and comparable metrics for funniness. Thus, we introduce Oogiri-Master and Oogiri-Corpus, which are a benchmark and dataset designed to enable rigorous evaluation of humor understanding in LLMs. Each prompt is paired with approximately 100 diverse candidate responses, and funniness is rated independently by approximately 100 human judges without access to others' ratings, reducing popularity bias and enabling robust aggregation. Using Oogiri-Corpus, we conduct a quantitative analysis of the linguistic factors associated with funniness, such as text length, ambiguity, and incongruity resolution, and derive objective metrics for predicting human judgments. Subsequently, we benchmark a range of LLMs and human baselines in Oogiri-Master, demonstrating that state-of-the-art models approach human performance and that insight-augmented prompting improves the model performance. Our results provide a principled basis for evaluating and advancing humor understanding in LLMs.
翻译:幽默是检验大型语言模型(LLM)是否具备类人创造性思维的重要测试平台。本研究利用日本创意应答游戏Oogiri探究幽默机制,该游戏中参与者需针对给定提示生成诙谐应答。我们提出以下研究问题:此类应答何以令人类感到有趣?先前研究仅提供有限可靠方法以解答此问题。现有数据集每个提示仅含少量候选应答,在评分过程中暴露流行度信号,且缺乏客观可比的趣味性度量指标。为此,我们提出Oogiri-Master基准与Oogiri-Corpus数据集,旨在实现对LLM幽默理解的严格评估。每个提示配备约100个多样化候选应答,趣味性由约100名人类评委在无法查看他人评分的情况下独立评定,从而降低流行度偏差并实现稳健的聚合评估。基于Oogiri-Corpus,我们对文本长度、歧义性、不一致性消解等与趣味性相关的语言因素进行定量分析,并推导出预测人类判断的客观指标。随后,我们在Oogiri-Master中对多种LLM及人类基线进行评测,结果表明前沿模型已接近人类表现,且洞察增强提示策略能提升模型性能。本研究为评估和推进LLM的幽默理解能力提供了理论依据。