We consider the problem of finding the optimal value of n in the n-step temporal difference (TD) algorithm. We find the optimal n by resorting to the model-free optimization technique of simultaneous perturbation stochastic approximation (SPSA). We adopt a one-simulation SPSA procedure that is originally for continuous optimization to the discrete optimization framework but incorporates a cyclic perturbation sequence. We prove the convergence of our proposed algorithm, SDPSA, and show that it finds the optimal value of n in n-step TD. Through experiments, we show that the optimal value of n is achieved with SDPSA for any arbitrary initial value of the same.
翻译:我们考虑了在正步时间差(TD)算法中找到n的最佳值的问题。我们通过使用同时扰动近似(SPSA)的无模型优化技术找到了最佳n。我们采用了一种一次性的模拟SPSA程序,最初是连续优化到离散优化框架,但采用了循环扰动序列。我们证明了我们提议的SDPSA算法的趋同,并表明它找到了n在正步TD中的最佳值。我们通过实验表明,与SDPSA一起实现n的最佳值与同一任意初始值的最佳值。</s>