Reinforcement learning is hard in general. Yet, in many specific environments, learning is easy. What makes learning easy in one environment, but difficult in another? We address this question by proposing a simple measure of reinforcement-learning hardness called the bad-policy density. This quantity measures the fraction of the deterministic stationary policy space that is below a desired threshold in value. We prove that this simple quantity has many properties one would expect of a measure of learning hardness. Further, we prove it is NP-hard to compute the measure in general, but there are paths to polynomial-time approximation. We conclude by summarizing potential directions and uses for this measure.
翻译:强化学习一般是困难的。然而,在许多特定的环境中,学习是容易的。是什么使得在一个环境中学习容易,而在另一个环境中则很困难?我们通过提出一个称为坏政策密度的简单强化学习硬性衡量标准来解决这个问题。这个数量用来衡量低于理想值阈值的决定性固定政策空间的一小部分。我们证明,这一简单数量具有许多特性,人们可以期望某种程度的学习硬性。此外,我们证明,一般地计算衡量尺度是很难的,但是有通往多时近似的途径。我们最后总结了这一措施的潜在方向和用途。