We consider a policy gradient algorithm applied to a finite-arm bandit problem with Bernoulli rewards. We allow learning rates to depend on the current state of the algorithm, rather than use a deterministic time-decreasing learning rate. The state of the algorithm forms a Markov chain on the probability simplex. We apply Foster-Lyapunov techniques to analyse the stability of this Markov chain. We prove that if learning rates are well chosen then the policy gradient algorithm is a transient Markov chain and the state of the chain converges on the optimal arm with logarithmic or poly-logarithmic regret.
翻译:我们认为Bernoulli奖赏中的政策梯度算法适用于有限武器强盗问题。 我们允许学习率取决于算法的当前状态, 而不是使用确定性的时间递减学习率。 算法的状态在概率简单x上形成了Markov链条。 我们运用Foster- Lyapunov 技术来分析这个Markov链条的稳定性。 我们证明, 如果学习率选择得当, 那么政策梯度算法就是瞬时的Markov 链条, 而链条的状态会在最佳手臂上与对数或多对数后悔相融合。