Large language models (LLMs) are increasingly deployed in real-world communication settings, yet their ability to resolve context-dependent ambiguity remains underexplored. In this work, we present EMODIS, a new benchmark for evaluating LLMs' capacity to interpret ambiguous emoji expressions under minimal but contrastive textual contexts. Each instance in EMODIS comprises an ambiguous sentence containing an emoji, two distinct disambiguating contexts that lead to divergent interpretations, and a specific question that requires contextual reasoning. We evaluate both open-source and API-based LLMs, and find that even the strongest models frequently fail to distinguish meanings when only subtle contextual cues are present. Further analysis reveals systematic biases toward dominant interpretations and limited sensitivity to pragmatic contrast. EMODIS provides a rigorous testbed for assessing contextual disambiguation, and highlights the gap in semantic reasoning between humans and LLMs.
翻译:大型语言模型(LLMs)在现实世界通信场景中的应用日益广泛,但其解决上下文相关歧义的能力仍未得到充分探索。本研究提出了Emodis,这是一个用于评估LLMs在最小但对比性文本语境下解释歧义表情符号表达能力的新基准。Emodis中的每个实例包含一个带有表情符号的歧义句子、两个导致不同解读的消歧上下文,以及一个需要上下文推理的具体问题。我们评估了开源和基于API的LLMs,发现即使是最强大的模型在仅存在细微语境线索时也常常无法区分含义。进一步分析揭示了模型对主流解读的系统性偏好以及对语用对比的敏感性有限。Emodis为评估上下文消歧能力提供了一个严谨的测试平台,并凸显了人类与LLMs在语义推理方面的差距。