This paper presents a novel algorithm for the continuous control of dynamical systems that combines Trajectory Optimization (TO) and Reinforcement Learning (RL) in a single framework. The motivations behind this algorithm are the two main limitations of TO and RL when applied to continuous nonlinear systems to minimize a non-convex cost function. Specifically, TO can get stuck in poor local minima when the search is not initialized close to a "good" minimum. On the other hand, when dealing with continuous state and control spaces, the RL training process may be excessively long and strongly dependent on the exploration strategy. Thus, our algorithm learns a "good" control policy via TO-guided RL policy search that, when used as initial guess provider for TO, makes the trajectory optimization process less prone to converge to poor local optima. Our method is validated on several reaching problems featuring non-convex obstacle avoidance with different dynamical systems, including a car model with 6D state, and a 3-joint planar manipulator. Our results show the great capabilities of CACTO in escaping local minima, while being more computationally efficient than the Deep Deterministic Policy Gradient (DDPG) and Proximal Policy Optimization (PPO) RL algorithms.
翻译:本文展示了一种新型的动态系统连续控制算法, 将轨迹优化(TO) 和强化学习( RL) 结合在一个单一的框架里进行。 此算法背后的动机是连续的非线性系统应用到连续的非线性系统以最大限度地减少非线性成本功能时, 和 RL 的两个主要限制。 具体地说, 当搜索未在接近“ 良好” 最低“ ” 的初始化时, 可以卡在贫穷的本地迷你中。 另一方面, 当处理连续状态和控制空间时, RL 培训过程可能会过长, 并在很大程度上依赖于勘探战略。 因此, 我们的算法通过TO 引导的 RL 政策搜索学习了“ 良好” 控制政策, 当用作初始的猜想提供者时, 轨迹优化进程不易与本地的偏差相交汇。 我们的方法被验证于几个问题, 这些问题涉及非节迹障碍避免与不同的动态系统, 包括具有 6D 状态的汽车模型和三联式平板操纵器。 我们的结果表明, CACTO 在逃离本地微型微型系统时, 并且比远为更具有计算效率 IPPIDDGDG 政策( ) 。