Playing an important role in Model-Based Reinforcement Learning (MBRL), environment models aim to predict future states based on the past. Existing works usually ignore instantaneous dependence in the state, that is, assuming that the future state variables are conditionally independent given the past states. However, instantaneous dependence is prevalent in many RL environments. For instance, in the stock market, instantaneous dependence can exist between two stocks because the fluctuation of one stock can quickly affect the other and the resolution of price change is lower than that of the effect. In this paper, we prove that with few exceptions, ignoring instantaneous dependence can result in suboptimal policy learning in MBRL. To address the suboptimality problem, we propose a simple plug-and-play method to enable existing MBRL algorithms to take instantaneous dependence into account. Through experiments on two benchmarks, we (1) confirm the existence of instantaneous dependence with visualization; (2) validate our theoretical findings that ignoring instantaneous dependence leads to suboptimal policy; (3) verify that our method effectively enables reinforcement learning with instantaneous dependence and improves policy performance.
翻译:环境模型在基于模型的强化学习(MBRL)中发挥重要作用,环境模型旨在预测基于过去的未来状态。现有的工程通常忽视国家中的瞬间依赖性,即假定未来状态变量在以往状态下有条件独立。然而,在许多RL环境中,瞬间依赖性很普遍。例如,在股票市场上,两个股票之间可能存在即时依赖性,因为一个股票的波动会很快影响另一个股票,而价格变化的解决方案比效果低。在本文中,我们证明,除了少数例外,忽视瞬间依赖性可能导致在MBRL中进行不最佳的政策学习。为了解决亚最佳性问题,我们建议一种简单的插接和播放方法,使现有的MBRL算法能够考虑到瞬间依赖性。通过两个基准的实验,我们(1) 证实存在瞬间依赖性和视觉化;(2) 证实我们的理论结论,即无视瞬间依赖性会导致亚于最优化的政策;(3) 核实我们的方法能够有效地加强瞬间依赖性学习并改进政策性。</s>