We study the reward-free reinforcement learning framework, which is particularly suitable for batch reinforcement learning and scenarios where one needs policies for multiple reward functions. This framework has two phases. In the exploration phase, the agent collects trajectories by interacting with the environment without using any reward signal. In the planning phase, the agent needs to return a near-optimal policy for arbitrary reward functions. We give a new efficient algorithm, \textbf{S}taged \textbf{S}ampling + \textbf{T}runcated \textbf{P}lanning (\algoname), which interacts with the environment at most $O\left( \frac{S^2A}{\epsilon^2}\text{poly}\log\left(\frac{SAH}{\epsilon}\right) \right)$ episodes in the exploration phase, and guarantees to output a near-optimal policy for arbitrary reward functions in the planning phase. Here, $S$ is the size of state space, $A$ is the size of action space, $H$ is the planning horizon, and $\epsilon$ is the target accuracy relative to the total reward. Notably, our sample complexity scales only \emph{logarithmically} with $H$, in contrast to all existing results which scale \emph{polynomially} with $H$. Furthermore, this bound matches the minimax lower bound $\Omega\left(\frac{S^2A}{\epsilon^2}\right)$ up to logarithmic factors. Our results rely on three new techniques : 1) A new sufficient condition for the dataset to plan for an $\epsilon$-suboptimal policy; 2) A new way to plan efficiently under the proposed condition using soft-truncated planning; 3) Constructing extended MDP to maximize the truncated accumulative rewards efficiently.
翻译:我们研究无报酬强化学习框架, 它特别适合批量强化学习, 以及需要多个奖赏功能的政策的情景。 这个框架有两个阶段。 在勘探阶段, 代理通过与环境互动收集轨迹, 而不使用任何奖赏信号 。 在规划阶段, 代理需要为任意奖赏功能返回接近最佳的政策 。 我们给出一个新的高效算法,\ textbf{ stative\\ textb{S} 复制 + textbf{ trookc} + textbf{ tr} 正在运行的 政策 。 这里, 美元是州空间的大小, 美元是最左端环境的, 亚端2A =2A = = = text =phlemental left 。 在勘探阶段, 提供一个新的高效算法, 保证为规划阶段的任意奖赏功能输出近端政策 $ 。 这里, 美元是州空间的大小, 美元是最底层的 计划 。