Reinforcement learning with function approximation has recently achieved tremendous results in applications with large state spaces. This empirical success has motivated a growing body of theoretical work proposing necessary and sufficient conditions under which efficient reinforcement learning is possible. From this line of work, a remarkably simple minimal sufficient condition has emerged for sample efficient reinforcement learning: MDPs with optimal value function $V^*$ and $Q^*$ linear in some known low-dimensional features. In this setting, recent works have designed sample efficient algorithms which require a number of samples polynomial in the feature dimension and independent of the size of state space. They however leave finding computationally efficient algorithms as future work and this is considered a major open problem in the community. In this work, we make progress on this open problem by presenting the first computational lower bound for RL with linear function approximation: unless NP=RP, no randomized polynomial time algorithm exists for deterministic transition MDPs with a constant number of actions and linear optimal value functions. To prove this, we show a reduction from Unique-Sat, where we convert a CNF formula into an MDP with deterministic transitions, constant number of actions and low dimensional linear optimal value functions. This result also exhibits the first computational-statistical gap in reinforcement learning with linear function approximation, as the underlying statistical problem is information-theoretically solvable with a polynomial number of queries, but no computationally efficient algorithm exists unless NP=RP. Finally, we also prove a quasi-polynomial time lower bound under the Randomized Exponential Time Hypothesis.
翻译:以功能近似方式强化学习 功能近似最近取得了巨大的效果 。 这一经验性的成功激励了越来越多的理论工作, 提出了必要和足够条件, 使高效强化学习成为可能。 从这一行工作, 出现了一个非常简单、 最起码的条件, 以抽样高效强化学习: 在一些已知的低维特性中, 最优值函数 MDP $V ⁇ $ $ 美元 和 $ $ $ 美元 线性 。 在此环境下, 最近的工作设计了一些样本高效算法, 需要一系列特征层面的多数值样本, 且独立于国家空间的大小。 然而, 却让计算高效的算法成为未来工作, 这被认为是社区中一个主要的开放问题。 在这项工作中, 我们通过展示第一个低标准基值的 RLL的计算下限约束值, 我们通过直线函数的直线性计算, 除非 NP=RP, 没有随机的多动作和线性最优值函数, 否则, 直线性平面的直线性计算结果的直径直线性计算结果。