FlashAdventure：一个用于评估GUI智能体在多样化冒险游戏中完成完整故事线的基准 (FlashAdventure: A Benchmark for GUI Agents Solving Full Story Arcs in Diverse Adventure Games)

GUI agents powered by LLMs show promise in interacting with diverse digital environments. Among these, video games offer a valuable testbed due to their varied interfaces, with adventure games posing additional challenges through complex, narrative-driven interactions. Existing game benchmarks, however, lack diversity and rarely evaluate agents on completing entire storylines. To address this, we introduce FlashAdventure, a benchmark of 34 Flash-based adventure games designed to test full story arc completion and tackle the observation-behavior gap: the challenge of remembering and acting on earlier gameplay information. We also propose CUA-as-a-Judge, an automated gameplay evaluator, and COAST, an agentic framework leveraging long-term clue memory to better plan and solve sequential tasks. Experiments show current GUI agents struggle with full story arcs, while COAST improves milestone completion by bridging the observation-behavior gap. Nonetheless, a marked discrepancy between humans and best-performing agents warrants continued research efforts to narrow this divide.

翻译：基于大语言模型的GUI智能体在交互多样化数字环境方面展现出潜力。其中，视频游戏因其多样的界面而成为一个有价值的测试平台，而冒险游戏则通过复杂的、叙事驱动的交互带来了额外挑战。然而，现有的游戏基准测试缺乏多样性，且很少评估智能体完成整个故事线的能力。为此，我们提出了FlashAdventure，这是一个包含34款基于Flash的冒险游戏的基准测试集，旨在评估完整故事线的完成情况，并应对观察-行为差距的挑战：即记忆并基于早期游戏信息采取行动的难题。我们还提出了CUA-as-a-Judge，一个自动化的游戏过程评估器，以及COAST，一个利用长期线索记忆来更好地规划和解决序列任务的智能体框架。实验表明，当前的GUI智能体在应对完整故事线时存在困难，而COAST通过弥合观察-行为差距，提高了里程碑任务的完成率。尽管如此，人类与表现最佳智能体之间仍存在显著差距，这需要持续的研究努力来缩小这一鸿沟。