The reinforcement learning problem of finding a control policy that minimizes the minimum time objective for the Mountain Car environment is considered. Particularly, a class of parameterized nonlinear feedback policies is optimized over to reach the top of the highest mountain peak in minimum time. The optimization is carried out using quasi-Stochastic Gradient Descent (qSGD) methods. In attempting to find the optimal minimum time policy, a new parameterized policy approach is considered that seeks to learn an optimal policy parameter for different regions of the state space, rather than rely on a single macroscopic policy parameter for the entire state space. This partitioned parameterized policy approach is shown to outperform the uniform parameterized policy approach and lead to greater generalization than prior methods, where the Mountain Car became trapped in circular trajectories in the state space.
翻译:研究寻找控制政策的强化学习问题,以尽量减少山车环境的最低时间目标。 特别是,将优化一组参数化的非线性反馈政策,以在最短的时间内达到最高山峰的顶部。 优化使用准随机梯度梯度梯度梯度法(qSGD)进行。 在试图找到最佳的最短时间政策时,将考虑一种新的参数化政策方法,以寻求为国家空间的不同区域学习最佳政策参数,而不是依赖整个国家空间的单一宏观政策参数。 这种分层化的参数化政策方法显示超越了统一的参数化政策方法,并导致比以前的方法更加普遍化,因为以前的方法是山车被困在州空间的圆轨中。