With increasing interest in procedural content generation by academia and game developers alike, it is vital that different approaches can be compared fairly. However, evaluating procedurally generated video game levels is often difficult, due to the lack of standardised, game-independent metrics. In this paper, we introduce two simulation-based evaluation metrics that involve analysing the behaviour of an A* agent to measure the diversity and difficulty of generated levels in a general, game-independent manner. Diversity is calculated by comparing action trajectories from different levels using the edit distance, and difficulty is measured as how much exploration and expansion of the A* search tree is necessary before the agent can solve the level. We demonstrate that our diversity metric is more robust to changes in level size and representation than current methods and additionally measures factors that directly affect playability, instead of focusing on visual information. The difficulty metric shows promise, as it correlates with existing estimates of difficulty in one of the tested domains, but it does face some challenges in the other domain. Finally, to promote reproducibility, we publicly release our evaluation framework.
翻译:随着学术界和游戏开发者对程序内容生成的兴趣日益浓厚,必须公平地比较不同的做法。然而,由于缺乏标准化的、不依赖游戏的量度,评估程序产生的电子游戏水平往往很困难。在本文件中,我们引入了两种模拟评价指标,即分析A* 代理器的行为,以一般的方式衡量生成水平的多样性和难度,这种方式与游戏无关。多样性是通过使用编辑距离比较不同层次的行动轨迹来计算的,而难度的衡量是,A* 搜索树的探索和扩展在代理器解决水平之前需要多少时间。我们证明,我们的多样性指标比目前的方法和直接影响到可播放性的额外措施因素更加强大,而不是侧重于视觉信息。困难指标显示了前景,因为它与在测试的某一领域对难度的现有估计有关,但在另一个领域确实面临一些挑战。最后,为了促进再生能力,我们公开公布了我们的评估框架。