Reinforcement learning (RL) algorithms aim to learn optimal decisions in unknown environments through experience of taking actions and observing the rewards gained. In some cases, the environment is not influenced by the actions of the RL agent, in which case the problem can be modeled as a contextual multi-armed bandit and lightweight \emph{myopic} algorithms can be employed. On the other hand, when the RL agent's actions affect the environment, the problem must be modeled as a Markov decision process and more complex RL algorithms are required which take the future effects of actions into account. Moreover, in many modern RL settings, it is unknown from the outset whether or not the agent's actions will impact the environment and it is often not possible to determine which RL algorithm is most fitting. In this work, we propose to avoid this dilemma entirely and incorporate a choice mechanism into our RL framework. Rather than assuming a specific problem structure, we use a probabilistic structure estimation procedure based on a likelihood-ratio (LR) test to make a more informed selection of learning algorithm. We derive a sufficient condition under which myopic policies are optimal, present an LR test for this condition, and derive a bound on the regret of our framework. We provide examples of real-world scenarios where our framework is needed and provide extensive simulations to validate our approach.
翻译:强化学习(RL)算法的目的是通过采取行动和观察所获回报的经验,在未知环境中学习最佳决策; 在某些情况下,环境不受RL代理商行动的影响,在这种情况下,问题可以模拟成一种背景多武装强盗和轻量量的算法; 另一方面,当RL代理商行动影响环境时,问题必须模拟成一个马可夫决策程序,需要更复杂的RL算法,以考虑到行动的未来效果; 此外,在许多现代RL设置中,从一开始就不知道该代理商的行动是否会影响环境,而且往往无法确定哪一种RL算法最合适; 在这项工作中,我们提议完全避免这一两难境地,并将一个选择机制纳入我们的RL框架。 我们使用一种基于可能性-拉皮(LR)测试的概率性结构估算程序,以更知情地选择学习算法; 此外,我们从一开始就无法知道该代理商者的行动是否会影响环境,往往无法确定哪些RL算法最合适; 我们建议完全避免这一困境,并将一个选择机制纳入我们的当前框架,我们提供一个最理想的模型。