In this paper, we consider the problem of adjusting the exploration rate when using value-of-information-based exploration. We do this by converting the value-of-information optimization into a problem of finding equilibria of a flow for a changing exploration rate. We then develop an efficient path-following scheme for converging to these equilibria and hence uncovering optimal action-selection policies. Under this scheme, the exploration rate is automatically adapted according to the agent's experiences. Global convergence is theoretically assured. We first evaluate our exploration-rate adaptation on the Nintendo GameBoy games Centipede and Millipede. We demonstrate aspects of the search process, like that it yields a hierarchy of state abstractions. We also show that our approach returns better policies in fewer episodes than conventional search strategies relying on heuristic, annealing-based exploration-rate adjustments. We then illustrate that these trends hold for deep, value-of-information-based agents that learn to play ten simple games and over forty more complicated games for the Nintendo GameBoy system. Performance either near or well above the level of human play is observed.
翻译:在本文中,我们考虑在使用基于信息的价值勘探时调整勘探率的问题。 我们这样做的方法是,将信息价值优化转化为为变化的勘探率寻找流动平衡的问题。 然后,我们制定高效的跟踪路径计划,以融合这些平衡,从而发现最佳的行动选择政策。 在这个计划下,勘探率根据代理人的经验自动调整。 从理论上讲,全球趋同是有保障的。我们首先评估我们对Nintendo GameBoy游戏的探索率调整。我们展示了搜索过程的方方面面,例如它产生一种状态抽象的等级。我们还表明,我们的方法比依靠超自然的、基于内线的勘探率调整的传统搜索战略少,还产生更好的政策。然后我们说明,这些趋势对学会玩10场简单游戏和超过40场更复杂的游戏的深层、有价值的、基于信息的代理人,这些代理人学会玩10场游戏,以及Nintendo GameBoy系统的游戏。我们观察到了接近或远高于人类游戏水平的业绩。