We consider the problem of finding the optimal value of n in the n-step temporal difference (TD) algorithm. We find the optimal n by resorting to the model-free optimization technique of simultaneous perturbation stochastic approximation (SPSA). We adopt a one-simulation SPSA procedure that is originally for continuous optimization to the discrete optimization framework but incorporates a cyclic perturbation sequence. We prove the convergence of our proposed algorithm, SDPSA, and show that it finds the optimal value of n in n-step TD. Through experiments, we show that the optimal value of n is achieved with SDPSA for any arbitrary initial value of the same.
翻译:本文考虑在n步时序差分算法中寻找最优的n值。我们采用了模型无关的优化技术——同时摇摆随机逼近法(SPSA)——找到最优的n。我们将原本针对连续优化的一次仿真SPSA过程引入到离散优化框架中,同时加入循环摇摆序列。我们证明了我们提出的SDPSA算法的收敛性并展示它可以找到任意初始值下的n步TD的最优参数。通过实验,我们展示了SDPSA算法确实可以实现最优的n值。