The empirical success of derivative-free methods in reinforcement learning for planning through contact seems at odds with the perceived fragility of classical gradient-based optimization methods in these domains. What is causing this gap, and how might we use the answer to improve gradient-based methods? We believe a stochastic formulation of dynamics is one crucial ingredient. We use tools from randomized smoothing to analyze sampling-based approximations of the gradient, and formalize such approximations through the gradient bundle. We show that using the gradient bundle in lieu of the gradient mitigates fast-changing gradients of non-smooth contact dynamics modeled by the implicit time-stepping, or the penalty method. Finally, we apply the gradient bundle to optimal control using iLQR, introducing a novel algorithm which improves convergence over using exact gradients. Combining our algorithm with a convex implicit time-stepping formulation of contact, we show that we can tractably tackle planning-through-contact problems in manipulation.
翻译:无衍生物方法在通过接触加强规划学习方面的经验成功似乎与这些领域经典梯度优化方法的脆弱感不相符。 是什么造成这一差距, 以及我们如何用答案改进梯度法? 我们相信动态的随机分析配方是一个关键要素。 我们使用随机平滑工具分析梯度的基模近似,并通过梯度捆绑正式确定这种近似值。 我们显示, 使用梯度捆绑取代梯度可以减轻以隐含时间间隔或惩罚方法为模型的非移动接触动态快速变化的梯度。 最后, 我们运用梯度捆绑优化控制, 使用 iLQR, 引入新的算法, 改善精确梯度的趋同 。 我们的算法与等式隐含时间间隔配方相结合, 我们显示, 我们能灵活地解决操纵中的规划- 通过接触问题 。