We consider model selection for classic Reinforcement Learning (RL) environments -- Multi Armed Bandits (MABs) and Markov Decision Processes (MDPs) -- under general function approximations. In the model selection framework, we do not know the function classes, denoted by $\mathcal{F}$ and $\mathcal{M}$, where the true models -- reward generating function for MABs and and transition kernel for MDPs -- lie, respectively. Instead, we are given $M$ nested function (hypothesis) classes such that true models are contained in at-least one such class. In this paper, we propose and analyze efficient model selection algorithms for MABs and MDPs, that \emph{adapt} to the smallest function class (among the nested $M$ classes) containing the true underlying model. Under a separability assumption on the nested hypothesis classes, we show that the cumulative regret of our adaptive algorithms match to that of an oracle which knows the correct function classes (i.e., $\cF$ and $\cM$) a priori. Furthermore, for both the settings, we show that the cost of model selection is an additive term in the regret having weak (logarithmic) dependence on the learning horizon $T$.
翻译:我们分别考虑经典加强学习环境的模型选择 -- -- 多重武装强盗和Markov 决策程序(MDPs) -- -- 在一般功能近似值下,我们不熟悉由$mathcal{F}美元和$mathcal{M}美元表示的功能类别,真正的模型 -- -- MABs和MDPs过渡核心的奖励产生功能和过渡核心 -- -- 分别存在。相反,我们得到的是$M$的嵌套功能(假说),因此,真正的模型包含在最小的功能类别中。在本文中,我们建议和分析MABs和MDPs的有效模型选择算法,这些函数类别中包括$mathcal{F}和$mathcal{M},它们包含真正的基本模型。根据嵌套假设,我们适应性算法的累积遗憾与熟悉正确功能类别(i.e.,$F$,$F$)和MAcri程的模型选择成本。我们之前的学习成本是$crial。