Model-based reinforcement learning (MBRL) reduces the cost of real-environment sampling by generating synthetic trajectories (called rollouts) from a learned dynamics model. However, choosing the length of the rollouts poses two dilemmas: (1) Longer rollouts better preserve on-policy training but amplify model bias, indicating the need for an intermediate horizon to mitigate distribution shift (i.e., the gap between on-policy and past off-policy samples). (2) Moreover, a longer model rollout may reduce value estimation bias but raise the variance of policy gradients due to backpropagation through multiple steps, implying another intermediate horizon for stable gradient estimates. However, these two optimal horizons may differ. To resolve this conflict, we propose Double Horizon Model-Based Policy Optimization (DHMBPO), which divides the rollout procedure into a long "distribution rollout" (DR) and a short "training rollout" (TR). The DR generates on-policy state samples for mitigating distribution shift. In contrast, the short TR leverages differentiable transitions to offer accurate value gradient estimation with stable gradient updates, thereby requiring fewer updates and reducing overall runtime. We demonstrate that the double-horizon approach effectively balances distribution shift, model bias, and gradient instability, and surpasses existing MBRL methods on continuous-control benchmarks in terms of both sample efficiency and runtime.
翻译:基于模型的强化学习(MBRL)通过从学习到的动力学模型中生成合成轨迹(称为rollout)来降低真实环境采样的成本。然而,选择rollout的长度面临两个困境:(1)较长的rollout能更好地保持同策略训练,但会放大模型偏差,这表明需要一个中间视野来缓解分布偏移(即同策略与历史异策略样本之间的差距)。(2)此外,较长的模型rollout可能减少价值估计偏差,但由于多步反向传播,会增加策略梯度的方差,这意味着需要另一个中间视野来获得稳定的梯度估计。然而,这两个最优视野可能不同。为了解决这一冲突,我们提出了双视野基于模型的策略优化(DHMBPO),它将rollout过程分为一个长的“分布rollout”(DR)和一个短的“训练rollout”(TR)。DR生成同策略状态样本以缓解分布偏移。相比之下,短TR利用可微分的转移过程,提供准确的价值梯度估计和稳定的梯度更新,从而需要更少的更新并减少总体运行时间。我们证明,双视野方法有效地平衡了分布偏移、模型偏差和梯度不稳定性,并在连续控制基准测试中,在样本效率和运行时间方面均超越了现有的MBRL方法。