In this paper, we present a policy gradient method that avoids exploratory noise injection and performs policy search over the deterministic landscape. By avoiding noise injection all sources of estimation variance can be eliminated in systems with deterministic dynamics (up to the initial state distribution). Since deterministic policy regularization is impossible using traditional non-metric measures such as the KL divergence, we derive a Wasserstein-based quadratic model for our purposes. We state conditions on the system model under which it is possible to establish a monotonic policy improvement guarantee, propose a surrogate function for policy gradient estimation, and show that it is possible to compute exact advantage estimates if both the state transition model and the policy are deterministic. Finally, we describe two novel robotic control environments -- one with non-local rewards in the frequency domain and the other with a long horizon (8000 time-steps) -- for which our policy gradient method (TDPO) significantly outperforms existing methods (PPO, TRPO, DDPG, and TD3). Our implementation with all the experimental settings is available at https://github.com/ehsansaleh/code_tdpo
翻译:在本文中,我们提出了一个政策梯度方法,避免探探性噪音注入,并在决定性的景观中进行政策搜索。通过避免噪音注入,可以在具有确定性动态的系统中消除所有估计差异的来源(直到最初的州分布)。由于确定性政策正规化不可能使用传统的非计量措施,例如KL差异,我们为我们的目的得出了瓦塞斯坦基的四边形模型。我们说明了在系统模型上可以建立单调政策改进保障的条件,提出了政策梯度估计的替代功能,并表明如果州过渡模式和政策都是确定性的,则有可能计算准确的优势估计。最后,我们描述了两个新型的机器人控制环境 -- -- 一个在频率域有非局部奖励,另一个是长视距(8 000个时段) -- -- 我们的政策梯度方法大大超越了现有的方法(PPO、TRPO、DG和TD3)。 我们对所有实验环境的落实情况可在https://github.com/ehsanaleh/code_cord_pod)。