In this paper, we propose and study opportunistic reinforcement learning - a new variant of reinforcement learning problems where the regret of selecting a suboptimal action varies under an external environmental condition known as the variation factor. When the variation factor is low, so is the regret of selecting a suboptimal action and vice versa. Our intuition is to exploit more when the variation factor is high, and explore more when the variation factor is low. We demonstrate the benefit of this novel framework for finite-horizon episodic MDPs by designing and evaluating OppUCRL2 and OppPSRL algorithms. Our algorithms dynamically balance the exploration-exploitation trade-off for reinforcement learning by introducing variation factor-dependent optimism to guide exploration. We establish an $\tilde{O}(HS \sqrt{AT})$ regret bound for the OppUCRL2 algorithm and show through simulations that both OppUCRL2 and OppPSRL algorithm outperform their original corresponding algorithms.
翻译:在本文中,我们提出并研究机会强化学习 -- -- 一种新的强化学习问题新变体,即选择次优行动的遗憾在被称为变异因素的外部环境条件下各不相同。当变异因素低时,选择次优行动的遗憾也低,选择次优行动的遗憾反之亦然。我们的直觉是当变异因素高时多加利用,当变异因素低时多加利用。我们通过设计和评价OppUCRL2和OppPPSRL算法,展示了有限和顺差的微软外延MDPs的新框架的好处。我们的算法通过引入因变异因素而乐观的思维来引导勘探,动态平衡了强化学习的勘探-开发权衡。我们为OppUCRL2算法和OppPSRL算法的模拟显示,OppUCRL2和OppPSRL算法都超越了它们原来的对应算法。