We propose two policy gradient algorithms for solving the problem of control in an off-policy reinforcement learning (RL) context. Both algorithms incorporate a smoothed functional (SF) based gradient estimation scheme. The first algorithm is a straightforward combination of importance sampling-based off-policy evaluation with SF-based gradient estimation. The second algorithm, inspired by the stochastic variance-reduced gradient (SVRG) algorithm, incorporates variance reduction in the update iteration. For both algorithms, we derive non-asymptotic bounds that establish convergence to an approximate stationary point. From these results, we infer that the first algorithm converges at a rate that is comparable to the well-known REINFORCE algorithm in an off-policy RL context, while the second algorithm exhibits an improved rate of convergence.
翻译:我们建议采用两种政策梯度算法来解决政策外强化学习(RL)背景下的控制问题。两种算法都包含一种平滑的功能(SF)基梯度估计办法。第一种算法是基于抽样的重要非政策性评估与基于SF的梯度估计的简单组合。第二种算法受随机差异性降低梯度(SVRG)梯度(SVRG)算法的启发,在更新的迭代法中包括了差异减少。对于这两种算法,我们从中得出了非救济性界限,这些界限使得趋同到大致的固定点。根据这些结果,我们推断,第一种算法的趋同率与众所周知的REINFORCE在非政策性RL背景下的REINFORCE算法相当,而第二种算法的趋同率则有所改善。