Bandit and reinforcement learning (RL) problems can often be framed as optimization problems where the goal is to maximize average performance while having access only to stochastic estimates of the true gradient. Traditionally, stochastic optimization theory predicts that learning dynamics are governed by the curvature of the loss function and the noise of the gradient estimates. In this paper we demonstrate that this is not the case for bandit and RL problems. To allow our analysis to be interpreted in light of multi-step MDPs, we focus on techniques derived from stochastic optimization principles (e.g., natural policy gradient and EXP3) and we show that some standard assumptions from optimization theory are violated in these problems. We present theoretical results showing that, at least for bandit problems, curvature and noise are not sufficient to explain the learning dynamics and that seemingly innocuous choices like the baseline can determine whether an algorithm converges. These theoretical findings match our empirical evaluation, which we extend to multi-state MDPs.
翻译:强盗和强盗学习(RL)问题通常可以被描述为优化问题,在优化问题上,目标是最大限度地提高平均性能,同时只能获取真实梯度的随机估计。 传统上, 随机优化理论预测, 学习动态受损失函数曲线和梯度估计噪音的调节。 在本文中, 我们证明, 土匪和RL问题不属于这种情况。 为了能够根据多步MDP来解释我们的分析, 我们侧重于从随机优化原理( 如自然政策梯度和EXP3)中得出的技术, 我们发现, 优化理论的一些标准假设在这些问题上遭到了违反。 我们提出的理论结果显示, 至少对于强盗问题, 曲线和噪音不足以解释学习动态, 而像基线这样的看似无意义的选择可以决定算法是否趋同。 这些理论结论符合我们的经验评估, 我们将其推广到多州 MDP 。