Text adventure games present unique challenges to reinforcement learning methods due to their combinatorially large action spaces and sparse rewards. The interplay of these two factors is particularly demanding because large action spaces require extensive exploration, while sparse rewards provide limited feedback. This work proposes to tackle the explore-vs-exploit dilemma using a multi-stage approach that explicitly disentangles these two strategies within each episode. Our algorithm, called eXploit-Then-eXplore (XTX), begins each episode using an exploitation policy that imitates a set of promising trajectories from the past, and then switches over to an exploration policy aimed at discovering novel actions that lead to unseen state spaces. This policy decomposition allows us to combine global decisions about which parts of the game space to return to with curiosity-based local exploration in that space, motivated by how a human may approach these games. Our method significantly outperforms prior approaches by 27% and 11% average normalized score over 12 games from the Jericho benchmark (Hausknecht et al., 2020) in both deterministic and stochastic settings, respectively. On the game of Zork1, in particular, XTX obtains a score of 103, more than a 2x improvement over prior methods, and pushes past several known bottlenecks in the game that have plagued previous state-of-the-art methods.
翻译:文本冒险游戏对强化学习方法提出了独特的挑战, 因为它们的组合性大动作空间和微薄的回报。 这两种因素的相互作用要求特别高, 因为大型行动空间需要广泛探索, 而微弱的回报则提供有限的反馈。 这项工作提议采用多阶段方法解决探索- 探索- 探索- 探索进进进进进进进进进进进进进进进进进进进进。 我们的算法叫做 exxploit- the Tour- e- eXplore (XTX), 使用一种仿照过去一系列充满希望的轨迹的剥削政策, 并随后转换为旨在发现新行动导致隐蔽状态空间的探索政策。 这一政策分化让我们能够结合全球决定, 利用基于好奇心的本地探索, 在每个插进进进进进进进进进进进进进进进进进进进进进进进进进进进进进进进进进进进进进进进进进进进进进进进进进进进进进进进进进进进进进进进进进进进进进进进进进进进进进进进进进进进进进进进进进进到进到进到进进进进进进进进进到进到进到进进进进进进进进进进进进到进进进进进进到到到进进进进进进进进进进进进进进到到到到到进进进进进进进进进进进进进进进进进进进进进进进到到到到到到进进进进进进进进进进进进进进进进进进进进进进进进进进进进进到进到进到进到进进进进进进进进进进进进进到进到进到进到进到进到进到进到进到进到进到进到进到进到进到进到进到进到进到进到进到进到进到进到进到进到进到进进进到进到进到进到进到进到进到进到进到进到进进进进进进进进进进进进进进进进到进到进进进进进进进进进