混合 MPC 和值函数接近, 用于高效强化学习 (Blending MPC & Value Function Approximation for Efficient Reinforcement Learning)

Model-Predictive Control (MPC) is a powerful tool for controlling complex, real-world systems that uses a model to make predictions about future behavior. For each state encountered, MPC solves an online optimization problem to choose a control action that will minimize future cost. This is a surprisingly effective strategy, but real-time performance requirements warrant the use of simple models. If the model is not sufficiently accurate, then the resulting controller can be biased, limiting performance. We present a framework for improving on MPC with model-free reinforcement learning (RL). The key insight is to view MPC as constructing a series of local Q-function approximations. We show that by using a parameter $\lambda$, similar to the trace decay parameter in TD($\lambda$), we can systematically trade-off learned value estimates against the local Q-function approximations. We present a theoretical analysis that shows how error from inaccurate models in MPC and value function estimation in RL can be balanced. We further propose an algorithm that changes $\lambda$ over time to reduce the dependence on MPC as our estimates of the value function improve, and test the efficacy our approach on challenging high-dimensional manipulation tasks with biased models in simulation. We demonstrate that our approach can obtain performance comparable with MPC with access to true dynamics even under severe model bias and is more sample efficient as compared to model-free RL.

翻译：模型预测控制(MPC)是控制复杂、真实世界系统的一个强大工具,它使用模型对未来行为作出预测。对于每一个州, MPC解决了一个在线优化问题, 以选择一个能尽量减少未来成本的控制动作。这是一个令人惊讶的有效战略, 但实时性能要求需要使用简单模型。如果模型不够准确, 由此产生的控制器可能会有偏差, 从而限制性能。我们提出了一个框架, 用无模型强化学习( RL) 来改进MPC 。关键的洞察力是将MPC 视为构建一系列本地Q功能近似值。我们通过使用一个参数 $\ lambda$, 类似于TD ($\ lambda$) 的痕色衰变参数, 我们可以用本地的Q- 功能近似模型系统化交换价值估计值。我们提出的理论分析显示, MPC 错误模型和RL 价值函数估计的错误可以如何平衡。我们进一步提议一种算法, 随着时间的推移, 改变$\lambda$, 来减少对 MPC 的依赖, 作为我们真实的模型的模型的对比性模型, 测试我们以具有挑战性的方法, 我们的功能, 测试我们的工作效率, 能够更具有挑战性地展示我们。