We consider the problem of finding the optimal value of n in the n-step temporal difference (TD) learning algorithm. We find the optimal n by resorting to a model-free optimization technique involving a one-simulation simultaneous perturbation stochastic approximation (SPSA) based procedure that we adopt to the discrete optimization setting by using a random projection approach. We prove the convergence of our proposed algorithm, SDPSA, using a differential inclusions approach and show that it finds the optimal value of n in n-step TD. Through experiments, we show that the optimal value of n is achieved with SDPSA for arbitrary initial values.
 翻译:我们考虑在n步时序差分(TD)学习算法中找到最佳的n值问题。我们采用一种基于一次模拟同步扰动随机逼近(SPSA)的模型无关的优化技术来找到最佳的n值。通过使用随机投影方法将其应用于离散优化设置中,我们开发了 SDPSA 算法。我们使用微分包含法证明了我们提出的算法的收敛性,并展示了 SDPSA 可以在任意初始值下找到 n 步 TD 的最佳值。通过实验,我们展示了 SDPSA 实现了最佳的 n 值。