We consider the problem of Reinforcement Learning for nonlinear stochastic dynamical systems. We show that in the RL setting, there is an inherent ``Curse of Variance" in addition to Bellman's infamous ``Curse of Dimensionality", in particular, we show that the variance in the solution grows factorial-exponentially in the order of the approximation. A fundamental consequence is that this precludes the search for anything other than ``local" feedback solutions in RL, in order to control the explosive variance growth, and thus, ensure accuracy. We further show that the deterministic optimal control has a perturbation structure, in that the higher order terms do not affect the calculation of lower order terms, which can be utilized in RL to get accurate local solutions.
翻译:我们考虑的是非线性随机动态系统的强化学习问题。我们发现,在RL设置中,除了Bellman的臭名昭著的“尺寸曲线”外,还有内在的“差异诅咒”问题,特别是,我们表明,解决方案的差异会以近似值的先后顺序增长。一个根本后果是,这排除了在RL中寻找“本地”反馈解决方案以外的任何东西,以便控制爆炸性差异增长,从而确保准确性。我们进一步表明,确定性的最佳控制有一个干扰结构,因为更高的顺序条件不会影响下级条件的计算,而低级条件在RL中可以用来获得准确的本地解决方案。