In many problem settings, most notably in game playing, an agent receives a possibly delayed reward for its actions. Often, those rewards are handcrafted and not naturally given. Even simple terminal-only rewards, like winning equals one and losing equals minus one, can not be seen as an unbiased statement, since these values are chosen arbitrarily, and the behavior of the learner may change with different encodings. It is hard to argue about good rewards and the performance of an agent often depends on the design of the reward signal. In particular, in domains where states by nature only have an ordinal ranking and where meaningful distance information between game state values is not available, a numerical reward signal is necessarily biased. In this paper we take a look at MCTS, a popular algorithm to solve MDPs, highlight a reoccurring problem concerning its use of rewards, and show that an ordinal treatment of the rewards overcomes this problem. Using the General Video Game Playing framework we show dominance of our newly proposed ordinal MCTS algorithm over other MCTS variants, based on a novel bandit algorithm that we also introduce and test versus UCB.
翻译:在许多问题设置中, 最明显的是游戏游戏中, 代理人可能因其行为而得到延迟的奖励。 通常, 这些奖励是手工制作的, 而不是自然的。 即使是简单的终极奖励, 如赢等于等于等于等于等于等于等于等于等于等于等于等于等于等于等于等于等于等于等于等于等于等于等于等于等于等于等于等于的奖励, 也不能被视为一种不带偏见的声明, 因为这些价值观是任意选择的, 而学习者的行为可能会随着不同的编码而改变。 很难争论好报的奖赏和代理人的表现往往取决于奖赏信号的设计。 特别是, 在自然状态仅具有交替等级和无法获得游戏状态值之间有意义的距离信息的领域中, 数字奖赏信号必然是有偏差的。 在本文中, 我们审视MCTS, 一种解决 MDPs 的流行算法, 突出在奖赏使用上反复出现的问题, 并表明对奖赏的处理方式克服了这个问题。 使用一般视频游戏框架, 我们展示了我们新提议的或非常规的 MCTS 算法相对于其他 MCTS 的变式的占优势, 。