Model-based reinforcement learning (MBRL) is believed to have higher sample efficiency compared with model-free reinforcement learning (MFRL). However, MBRL is plagued by dynamics bottleneck dilemma. Dynamics bottleneck dilemma is the phenomenon that the performance of the algorithm falls into the local optimum instead of increasing when the interaction step with the environment increases, which means more data can not bring better performance. In this paper, we find that the trajectory reward estimation error is the main reason that causes dynamics bottleneck dilemma through theoretical analysis. We give an upper bound of the trajectory reward estimation error and point out that increasing the agent's exploration ability is the key to reduce trajectory reward estimation error, thereby alleviating dynamics bottleneck dilemma. Motivated by this, a model-based control method combined with exploration named MOdel-based Progressive Entropy-based Exploration (MOPE2) is proposed. We conduct experiments on several complex continuous control benchmark tasks. The results verify that MOPE2 can effectively alleviate dynamics bottleneck dilemma and have higher sample efficiency than previous MBRL and MFRL algorithms.
翻译:据认为,基于模型的强化学习(MBRL)的样本效率高于没有模型的强化学习(MFRL),但是,MBRL受到动态瓶颈困境的困扰。动态瓶颈困境是一种现象,即当与环境的互动步骤增加时,算法的性能会降为当地最佳,而不是增加,这意味着更多的数据不能带来更好的业绩。在本文中,我们发现轨迹奖励估计错误是通过理论分析造成动态瓶颈困境的主要原因。我们给出了轨迹奖励估计错误的上限,并指出增加该代理人的勘探能力是减少轨迹奖励估计错误的关键,从而缓解动态瓶颈困境。我们提议采用基于模型的控制方法,与名为MOdel的基于进步加密的勘探(MOPE2)的勘探相结合。我们就若干复杂的连续控制基准任务进行实验。结果证实,MOPE2能够有效减轻动态瓶颈困境,并比以前MBRL和MFRL的算法具有更高的样本效率。