Reinforcement learning (RL) algorithms aim to learn optimal decisions in unknown environments through experience of taking actions and observing the rewards gained. In some cases, the environment is not influenced by the actions of the RL agent, in which case the problem can be modeled as a contextual multi-armed bandit and lightweight myopic algorithms can be employed. On the other hand, when the RL agent's actions affect the environment, the problem must be modeled as a Markov decision process and more complex RL algorithms are required which take the future effects of actions into account. Moreover, in practice, it is often unknown from the outset whether or not the agent's actions will impact the environment and it is therefore not possible to determine which RL algorithm is most fitting. In this work, we propose to avoid this difficult decision entirely and incorporate a choice mechanism into our RL framework. Rather than assuming a specific problem structure, we use a probabilistic structure estimation procedure based on a likelihood-ratio (LR) test to make a more informed selection of learning algorithm. We derive a sufficient condition under which myopic policies are optimal, present an LR test for this condition, and derive a bound on the regret of our framework. We provide examples of real-world scenarios where our framework is needed and provide extensive simulations to validate our approach.
翻译:强化学习(RL)算法的目的是通过采取行动和观察所获回报的经验,在未知环境中学习最佳决策; 在某些情况下,环境不受RL代理商行动的影响,在这种情况下,问题可以模拟为背景多武装强盗和轻量级近视算法; 另一方面,当RL代理商行动影响环境时,问题必须模拟为马可夫决策程序,需要更复杂的RL算法,以考虑到行动的未来效果; 此外,在实践中,从一开始,人们往往不知道该代理商的行动是否会影响环境,因此不可能确定哪一种RL算法最合适; 在这项工作中,我们提议完全避免这一困难决定,并将一个选择机制纳入我们的RL框架; 我们使用一种基于可能性拉特(LR)测试的不稳定性结构估算程序,以更知情地选择学习算法; 此外,我们发现一个充分的条件,即我的opic 政策是最佳的,因此无法确定哪些RLL算法最为合适; 我们建议完全避免这一困难的决定,并将我们所需要的框架纳入我们的模拟框架。