Neural replicator dynamics (NeuRD) is an alternative to the foundational softmax policy gradient (SPG) algorithm motivated by online learning and evolutionary game theory. The NeuRD expected update is designed to be nearly identical to that of SPG, however, we show that the Monte Carlo updates differ in a substantial way: the importance correction accounting for a sampled action is nullified in the SPG update, but not in the NeuRD update. Naturally, this causes the NeuRD update to have higher variance than its SPG counterpart. Building on implicit exploration algorithms in the adversarial bandit setting, we introduce capped implicit exploration (CIX) estimates that allow us to construct NeuRD-CIX, which interpolates between this aspect of NeuRD and SPG. We show how CIX estimates can be used in a black-box reduction to construct bandit algorithms with regret bounds that hold with high probability and the benefits this entails for NeuRD-CIX in sequential decision-making settings. Our analysis reveals a bias--variance tradeoff between SPG and NeuRD, and shows how theory predicts that NeuRD-CIX will perform well more consistently than NeuRD while retaining NeuRD's advantages over SPG in non-stationary environments.
翻译:NeuRD的预期更新设计与SPG几乎完全相同,然而,我们显示蒙特卡洛的更新有很大的不同:抽样行动的重要校正核算在SPG的更新中是无效的,但在NeuRD的更新中则不是。自然,这导致NeurRD的更新与SPG的对应的更新有更大的差异。基于在对抗性强盗设置中的隐性探索算法,我们引入了允许我们建造NeuRD-CIX的上限隐含探索算法(CIX)估计,这在NeuRD和SPG的这一侧面之间是相互对接的。我们展示了CIX的估计数如何在黑盒缩减中使用,以构建具有高度概率的遗憾界限的波段算法,以及由此给NeuRD-CIX在顺序决策环境中带来的好处。我们的分析揭示了SPG和NeurdRD之间的偏差差异,并显示理论将如何预测NeurD-CIX在NEVRD的NURPA环境上比NURD的优势持续保持。