Video game testing requires game-specific knowledge as well as common sense reasoning about the events in the game. While AI-driven agents can satisfy the first requirement, it is not yet possible to meet the second requirement automatically. Therefore, video game testing often still relies on manual testing, and human testers are required to play the game thoroughly to detect bugs. As a result, it is challenging to fully automate game testing. In this study, we explore the possibility of leveraging the zero-shot capabilities of large language models for video game bug detection. By formulating the bug detection problem as a question-answering task, we show that large language models can identify which event is buggy in a sequence of textual descriptions of events from a game. To this end, we introduce the GameBugDescriptions benchmark dataset, which consists of 167 buggy gameplay videos and a total of 334 question-answer pairs across 8 games. We extensively evaluate the performance of six models across the OPT and InstructGPT large language model families on our benchmark dataset. Our results show promising results for employing language models to detect video game bugs. With the proper prompting technique, we could achieve an accuracy of 70.66%, and on some video games, up to 78.94%. Our code, evaluation data and the benchmark can be found on https://asgaardlab.github.io/LLMxBugs
翻译:视频游戏测试需要特定游戏的知识和关于游戏中事件的常识推理。 虽然 AI 驱动的代理商可以满足第一个要求, 但仍无法自动满足第二个要求。 因此, 视频游戏测试通常仍然依靠人工测试, 需要人类测试者在游戏中彻底玩游戏, 以检测错误。 因此, 完全自动测试游戏具有挑战性。 在这次研究中, 我们探索利用大型语言模型的零发能力来检测游戏错误的可能性。 通过将错误检测问题作为解答任务, 我们显示大型语言模型可以在游戏事件文字描述序列中识别出哪些事件是错误的。 因此, 我们引入了 GameB 描述基准数据集数据集, 由167个错误游戏视频组成, 共有8场游戏的334对问答配对组成。 我们广泛评估了六个模型的性能, 并在我们的基准数据集中教给GPT 大型语言模型家族 。 我们的成果显示使用语言模型检测视频游戏错误的有希望的结果。 有了适当的快速技术, 我们可以在 GameB amb/ b 上实现我们的数据的精确度 。