The current paper studies sample-efficient Reinforcement Learning (RL) in settings where only the optimal value function is assumed to be linearly-realizable. It has recently been understood that, even under this seemingly strong assumption and access to a generative model, worst-case sample complexities can be prohibitively (i.e., exponentially) large. We investigate the setting where the learner additionally has access to interactive demonstrations from an expert policy, and we present a statistically and computationally efficient algorithm (Delphi) for blending exploration with expert queries. In particular, Delphi requires $\tilde{\mathcal{O}}(d)$ expert queries and a $\texttt{poly}(d,H,|\mathcal{A}|,1/\varepsilon)$ amount of exploratory samples to provably recover an $\varepsilon$-suboptimal policy. Compared to pure RL approaches, this corresponds to an exponential improvement in sample complexity with surprisingly-little expert input. Compared to prior imitation learning (IL) approaches, our required number of expert demonstrations is independent of $H$ and logarithmic in $1/\varepsilon$, whereas all prior work required at least linear factors of both in addition to the same dependence on $d$. Towards establishing the minimal amount of expert queries needed, we show that, in the same setting, any learner whose exploration budget is polynomially-bounded (in terms of $d,H,$ and $|\mathcal{A}|$) will require at least $\tilde\Omega(\sqrt{d})$ oracle calls to recover a policy competing with the expert's value function. Under the weaker assumption that the expert's policy is linear, we show that the lower bound increases to $\tilde\Omega(d)$.
翻译:目前的文件研究样本高效强化学习(RL) 在只有最佳值功能被假定为可线性实现的环境下。 最近人们理解, 即便在这种看似强有力的假设和获得基因模型的情况下, 最坏的样本复杂性可能令人望而却步地( 即, 指数性) 巨大。 我们调查了学习者有更多机会从专家政策中获取互动演示的设置, 并且我们展示了一个统计和计算高效的算法( Delphi) 将勘探与专家查询相结合。 具体来说, Delphi 需要美元( 美元) 的专家查询和 美元( 美元) (美元) 和 美元( 美元) ( 美元) 和 美元( 美元) 调低的专家查询和 美元( 美元) 的勘探样本数量( 美元), 要求专家演示数量( 美元) 和 直线性( 美元) 政策数量( ) 显示, 美元( 美元) 和 美元( 美元( 美元) ( 美元) 美元( ) 美元( 美元) ( 美元) 美元( 美元) ( 美元) ( 美元) ( 美元) ( 美元) ( 美元) ( 美元) ( 美元) ( 美元) ( 美元) ( 美元) ( 美元) (美元) (美元) (美元) (美元) (美元) (美元) (美元) (美元) ( 美元) (美元) (美元) ( ) (美元) (美元) (美元) (美元) (美元) (美元) (美元) (美元) (美元) (美元) (美元) (美元) (美元) (美元) (美元) (美元) (美元) (美元) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) (美元) (美元) (美元) ( ) ( ) ( ) (美元) (美元) (美元) (美元) (美元) (美元) (美元) (美元) (美元) (美元) (美元) (美元) (美元) (美元) (美元) (美元) (美元) (美元) (美元) (美元) (美元) (美元) (美元) (美元) (美元)