Temporal difference (TD) learning and its variants, such as multistage TD (MS-TD) learning and temporal coherence (TC) learning, have been successfully applied to 2048. These methods rely on the stochasticity of the environment of 2048 for exploration. In this paper, we propose to employ optimistic initialization (OI) to encourage exploration for 2048, and empirically show that the learning quality is significantly improved. This approach optimistically initializes the feature weights to very large values. Since weights tend to be reduced once the states are visited, agents tend to explore those states which are unvisited or visited few times. Our experiments show that both TD and TC learning with OI significantly improve the performance. As a result, the network size required to achieve the same performance is significantly reduced. With additional tunings such as expectimax search, multistage learning, and tile-downgrading technique, our design achieves the state-of-the-art performance, namely an average score of 625 377 and a rate of 72% reaching 32768 tiles. In addition, for sufficiently large tests, 65536 tiles are reached at a rate of 0.02%.
翻译:多阶段TD(MS-TD)学习和时间一致性(TC)学习等时间差异(TD)学习及其变体,如多阶段TD(MS-TD)学习和时间差异(TC)学习,已经成功地应用于2048年。这些方法依靠2048年环境的随机性进行勘探。在本文中,我们提议采用乐观初始化(OI)以鼓励2048年的勘探,并从经验上表明学习质量显著提高。这一方法乐观地将特质权重初始化为非常大的价值。由于重力在州访问之后往往会减少,代理商往往会探索那些未访问或访问过几次的州。我们的实验表明,与OI一起的TD和TC学习都显著改善了绩效。因此,实现相同绩效所需的网络规模大大缩小。通过额外的调整,如期望式搜索、多阶段学习和降低分级技术,我们的设计达到了最先进的性能,即平均分数为625,377,72%的比值达到32768瓦。此外,对于足够大的测试,65536马力达到。