Most modern reinforcement learning algorithms optimize a cumulative single-step cost along a trajectory. The optimized motions are often 'unnatural', representing, for example, behaviors with sudden accelerations that waste energy and lack predictability. In this work, we present a novel paradigm of controlling nonlinear systems via the minimization of the Koopman spectrum cost: a cost over the Koopman operator of the controlled dynamics. This induces a broader class of dynamical behaviors that evolve over stable manifolds such as nonlinear oscillators, closed loops, and smooth movements. We demonstrate that some dynamics realizations that are not possible with a cumulative cost are feasible in this paradigm. Moreover, we present a provably efficient online learning algorithm for our problem that enjoys a sub-linear regret bound under some structural assumptions.
翻译:多数现代强化学习算法沿轨迹优化累积的单步成本。 优化的动作往往是“ 不自然的 ”, 例如, 代表着突然加速的行为, 浪费能源和缺乏可预测性。 在这项工作中, 我们展示了一种新的模式, 通过最小化库普曼光谱成本来控制非线性系统: 超过可控动态库普曼操作员的成本。 这引发了范围更广的动态行为, 超越了非线性振荡器、 闭路和平稳移动等稳定的元体。 我们证明, 在这种模式中, 无法以累积成本实现的一些动态是可行的。 此外, 我们提出了一种可以想象的高效在线学习算法, 用于我们的问题, 在一些结构假设下, 存在亚线性后悔。