We present an interactive framework for evaluating whether large language models (LLMs) exhibit genuine "understanding" in a simple yet strategic environment. As a running example, we focus on Rock-Paper-Scissors (RPS), which, despite its apparent simplicity, requires sequential reasoning, adaptation, and strategy recognition. Our system positions the LLM as an Observer whose task is to identify which strategies are being played and to articulate the reasoning behind this judgment. The purpose is not to test knowledge of Rock-Paper-Scissors itself, but to probe whether the model can exhibit mind-like reasoning about sequential behavior. To support systematic evaluation, we provide a benchmark consisting of both static strategies and lightweight dynamic strategies specified by well-prompted rules. We quantify alignment between the Observer's predictions and the ground-truth distributions induced by actual strategy pairs using three complementary signals: Cross-Entropy, Brier score, and Expected Value (EV) discrepancy. These metrics are further integrated into a unified score, the Union Loss, which balances calibration, sensitivity, and payoff alignment. Together with a Strategy Identification Rate (SIR) metric, our framework captures not only predictive accuracy but also whether the model can stably identify the latent strategies in play. The demo emphasizes interactivity, transparency, and reproducibility. Users can adjust LLM distributions in real time, visualize losses as they evolve, and directly inspect reasoning snippets to identify where and why failures occur. In doing so, our system provides a practical and interpretable proxy for mind-like inference in sequential games, offering insights into both the strengths and limitations of current LLM reasoning.
翻译:我们提出了一个交互式框架,用于评估大型语言模型(LLMs)在简单而策略性的环境中是否展现出真正的“理解”。作为一个持续示例,我们聚焦于石头-剪刀-布(RPS)游戏,该游戏尽管表面简单,却需要序列推理、适应性和策略识别能力。我们的系统将LLM定位为观察者,其任务是识别正在执行的策略,并阐明这一判断背后的推理过程。目的并非测试对石头-剪刀-布游戏本身的知识,而是探究模型是否能展现出对序列行为的类心智推理。为支持系统化评估,我们提供了一个基准,包含由精心设计的规则指定的静态策略和轻量级动态策略。我们使用三个互补信号量化观察者预测与实际策略对诱导的真实分布之间的对齐程度:交叉熵、Brier分数和期望值(EV)差异。这些指标进一步整合为一个统一分数——联合损失,它平衡了校准度、敏感性和收益对齐性。结合策略识别率(SIR)指标,我们的框架不仅捕捉预测准确性,还能评估模型是否能稳定识别潜在策略。该演示强调交互性、透明度和可复现性。用户可以实时调整LLM分布,可视化损失演变过程,并直接检查推理片段以识别失败发生的位置及原因。通过这种方式,我们的系统为序列游戏中的类心智推理提供了一个实用且可解释的代理方法,为当前LLM推理的优势与局限性提供了深入见解。