In this work we present a preliminary investigation of a novel algorithm called Dyna-T. In reinforcement learning (RL) a planning agent has its own representation of the environment as a model. To discover an optimal policy to interact with the environment, the agent collects experience in a trial and error fashion. Experience can be used for learning a better model or improve directly the value function and policy. Typically separated, Dyna-Q is an hybrid approach which, at each iteration, exploits the real experience to update the model as well as the value function, while planning its action using simulated data from its model. However, the planning process is computationally expensive and strongly depends on the dimensionality of the state-action space. We propose to build a Upper Confidence Tree (UCT) on the simulated experience and search for the best action to be selected during the on-line learning process. We prove the effectiveness of our proposed method on a set of preliminary tests on three testbed environments from Open AI. In contrast to Dyna-Q, Dyna-T outperforms state-of-the-art RL agents in the stochastic environments by choosing a more robust action selection strategy.
翻译:在这项工作中,我们提出了一个名为Dyna-T的新算法的初步调查。在强化学习(RL)中,一名规划人员有自己的环境模型代表。为了发现与环境互动的最佳政策,该代理人员收集了试验和错误时的经验;经验可用于学习更好的模型或直接改进价值函数和政策。通常,Dyna-Q是一种混合方法,每次迭代都利用实际经验更新模型和价值函数,同时利用模型模拟数据规划其行动。然而,规划过程在计算上费用很高,并且在很大程度上取决于国家行动空间的多维性。我们提议在模拟经验的基础上建立一个高信任树,并寻找在在线学习过程中选择的最佳行动。我们证明我们提出的方法对开放AI的三种测试环境进行一套初步测试的有效性。与Dyna-Q相比,Dyna-T在科学环境中选择更稳健的战略选择,从而在科学环境中选择一套最先进的RL代理人员。