This paper proposes a new regularization technique for reinforcement learning (RL) towards making policy and value functions smooth and stable. RL is known for the instability of the learning process and the sensitivity of the acquired policy to noise. Several methods have been proposed to resolve these problems, and in summary, the smoothness of policy and value functions learned mainly in RL contributes to these problems. However, if these functions are extremely smooth, their expressiveness would be lost, resulting in not obtaining the global optimal solution. This paper therefore considers RL under local Lipschitz continuity constraint, so-called L2C2. By designing the spatio-temporal locally compact space for L2C2 from the state transition at each time step, the moderate smoothness can be achieved without loss of expressiveness. Numerical noisy simulations verified that the proposed L2C2 outperforms the task performance while smoothing out the robot action generated from the learned policy.
翻译:本文提出一种新的强化学习正规化技术,使政策和价值功能平稳稳定,学习过程不稳定,获得的政策对噪音敏感,因此众所周知,学习过程不稳定,获得的政策对噪音敏感,提出了解决这些问题的若干方法,概括地说,主要是学习在学习过程中的政策和价值功能的顺利性促成了这些问题,但是,如果这些功能非常顺利,其表达性就会丧失,从而无法达成全球最佳解决方案。因此,本文认为,根据当地Lipschitz连续性限制,即所谓的L2C2.,在设计从国家过渡的每一阶段为L2C2设计的时空局部紧凑空间时空时空时空时空时空时空,可以实现中度的平稳,而不会失去清晰性。