One of the major difficulties of reinforcement learning is learning from {\em off-policy} samples, which are collected by a different policy (behavior policy) from what the algorithm evaluates (the target policy). Off-policy learning needs to correct the distribution of the samples from the behavior policy towards that of the target policy. Unfortunately, important sampling has an inherent high variance issue which leads to poor gradient estimation in policy gradient methods. We focus on an off-policy Actor-Critic architecture, and propose a novel method, called Preconditioned Proximal Policy Optimization (P3O), which can control the high variance of importance sampling by applying a preconditioner to the Conservative Policy Iteration (CPI) objective. {\em This preconditioning uses the sigmoid function in a special way that when there is no policy change, the gradient is maximal and hence policy gradient will drive a big parameter update for an efficient exploration of the parameter space}. This is a novel exploration method that has not been studied before given that existing exploration methods are based on the novelty of states and actions. We compare with several best-performing algorithms on both discrete and continuous tasks and the results confirmed that {\em P3O is more off-policy than PPO} according to the "off-policyness" measured by the DEON metric, and P3O explores in a larger policy space than PPO. Results also show that our P3O maximizes the CPI objective better than PPO during the training process.
 翻译:强化学习的主要困难之一是从算法所评估(目标政策)的不同政策(行为政策)采集的样本中学习强化学习的主要困难之一,这些样本是从算法所评估(目标政策)的不同政策(行为政策)中收集的。 离政策学习需要纠正行为政策样本的分配情况, 向目标政策学习。 不幸的是, 重要的抽样具有固有的差异性, 导致政策梯度方法的梯度估计不力。 我们侧重于一个离政策Act- critical- critical 结构, 并提出了一种新颖的方法, 称为P3O, 称为P3P3, 这种方法可以控制重要性的高度差异, 通过对保守政策循环(CPI)目标应用一个先决条件来控制。 ~这个先决条件使用示意函数的特殊方式是,在没有政策变化的情况下, 梯度是最大化的,因此政策梯度将驱动一个大参数更新,以高效探索参数空间。 这是一个新的探索方法,以前未曾研究过,因为现有的探索方法是以国家和行动的新颖的特性为基础,可以控制重要性的高度差异采样。 我们用一些最佳的P-DE培训结果比PPP3 更能显示离离离PPPPPPPPPPPPP3, 更精确和PPPPPPPPPPPPP3 比较了比实际政策的结果。