Classical value iteration approaches are not applicable to environments with continuous states and actions. For such environments, the states and actions are usually discretized, which leads to an exponential increase in computational complexity. In this paper, we propose continuous fitted value iteration (cFVI). This algorithm enables dynamic programming for continuous states and actions with a known dynamics model. Leveraging the continuous-time formulation, the optimal policy can be derived for non-linear control-affine dynamics. This closed-form solution enables the efficient extension of value iteration to continuous environments. We show in non-linear control experiments that the dynamic programming solution obtains the same quantitative performance as deep reinforcement learning methods in simulation but excels when transferred to the physical system. The policy obtained by cFVI is more robust to changes in the dynamics despite using only a deterministic model and without explicitly incorporating robustness in the optimization. Videos of the physical system are available at \url{https://sites.google.com/view/value-iteration}.
翻译:经典值迭代方法不适用于具有连续状态和行动的环境。 对于这些环境, 状态和行动通常是分散的, 从而导致计算复杂性的指数性增加。 在本文中, 我们提出连续安装的值迭代( cFVI) 。 这个算法能够为已知动态模型的连续状态和行动提供动态编程。 利用连续时间的配制, 可以为非线性控制- 情感动态得出最佳政策。 这个封闭式解决方案可以将值迭代有效扩展至连续环境。 我们在非线性控制实验中显示, 动态编程解决方案在模拟中获得与深度强化学习方法相同的量化性能, 但是在转移到物理系统时优异。 CFVI 所获得的政策对于动态变化来说更加强大, 尽管只使用了一种确定性模型, 并且没有在优化中明确纳入稳健性。 物理系统的视频可以在\ url{ https:// sites.gogle. com/view/ iteration) 上查阅 。