While much progress has been made in understanding the minimax sample complexity of reinforcement learning (RL) -- the complexity of learning on the "worst-case" instance -- such measures of complexity often do not capture the true difficulty of learning. In practice, on an "easy" instance, we might hope to achieve a complexity far better than that achievable on the worst-case instance. In this work we seek to understand the "instance-dependent" complexity of learning near-optimal policies (PAC RL) in the setting of RL with linear function approximation. We propose an algorithm, \textsc{Pedel}, which achieves a fine-grained instance-dependent measure of complexity, the first of its kind in the RL with function approximation setting, thereby capturing the difficulty of learning on each particular problem instance. Through an explicit example, we show that \textsc{Pedel} yields provable gains over low-regret, minimax-optimal algorithms and that such algorithms are unable to hit the instance-optimal rate. Our approach relies on a novel online experiment design-based procedure which focuses the exploration budget on the "directions" most relevant to learning a near-optimal policy, and may be of independent interest.
翻译:虽然在理解强化学习(RL)的微缩样本复杂性(RL)方面已经取得了很大的进展,在“最差情况”实例上学习的复杂性(LL)——在“最坏情况”实例上学习的复杂性——这种复杂情况往往不能反映真正的学习困难。在实践上,在“最容易”实例上,我们可能希望实现比在最坏情况中可以实现的复杂得多的复杂情况。在这项工作中,我们力求理解学习接近最佳情况的政策(PAC RL)在设置具有线性功能近似值的RL(PAC RL)时的“依赖情况”的复杂性(PAC RL)。我们建议一种算法(\ textsc{Pedel}),这种算法可以实现精细的、依赖实例的复杂程度的、取决于具体实例的衡量标准。在RL(RL)中,第一种类型是功能接近接近接近于功能逼近的“最困难”的,从而捕捉到每个特定问题案例的学习困难。我们通过一个明确的例子表明,\textscrialcal-imalimalimal-imal destial-paltial-proview-d priview