An appropriate reward function is of paramount importance in specifying a task in reinforcement learning (RL). Yet, it is known to be extremely challenging in practice to design a correct reward function for even simple tasks. Human-in-the-loop (HiL) RL allows humans to communicate complex goals to the RL agent by providing various types of feedback. However, despite achieving great empirical successes, HiL RL usually requires too much feedback from a human teacher and also suffers from insufficient theoretical understanding. In this paper, we focus on addressing this issue from a theoretical perspective, aiming to provide provably feedback-efficient algorithmic frameworks that take human-in-the-loop to specify rewards of given tasks. We provide an active-learning-based RL algorithm that first explores the environment without specifying a reward function and then asks a human teacher for only a few queries about the rewards of a task at some state-action pairs. After that, the algorithm guarantees to provide a nearly optimal policy for the task with high probability. We show that, even with the presence of random noise in the feedback, the algorithm only takes $\widetilde{O}(H{{\dim_{R}^2}})$ queries on the reward function to provide an $\epsilon$-optimal policy for any $\epsilon > 0$. Here $H$ is the horizon of the RL environment, and $\dim_{R}$ specifies the complexity of the function class representing the reward function. In contrast, standard RL algorithms require to query the reward function for at least $\Omega(\operatorname{poly}(d, 1/\epsilon))$ state-action pairs where $d$ depends on the complexity of the environmental transition.
翻译:在強化學習(RL)中,一個恰當的獎勵函數對於指定任務至關重要。然而,即使是對於簡單任務,設計正確的獎勵函數在實踐中也被認為是非常具有挑戰性的。人為輔助強化學習(HiL RL)允許人類通過提供各種類型的反饋,將複雜的目標傳達給 RL 代理。然而,盡管取得了巨大的實證成功,HiL RL 通常需要過多的來自人類教師的反饋,並且還存在理論上的理解不足的問題。本文致力於從理論的角度來解決這個問題,旨在提供可證明的反饋高效的算法框架,透過進一步地與人進行互動,指定 RL 任務的獎勵。我們提供了一種基於積極學習的 RL 算法,該算法首先在未指定獎勵函數的情況下探索環境,然後只向人類教師詢問關於某些狀態-動作對的任務獎勵的少量查詢。此後,算法保證能以高概率提供幾乎最優策略的任務。我們表明,在反饋中存在隨機噪聲的情況下,即使是在 $\epsilon>0$ 的任何 $\epsilon$ 下,該算法僅需要對獎勵函數進行 $\widetilde{O}(H\dim_{R}^2)$ 查詢即可提供 $\epsilon$-最優策略。其中, $H$ 代表 RL 環境的終端, $\dim_{R}$ 指代表對獎勵函數的函數類的複雜程度。相比之下,標準的 RL 算法需要至少在 $\Omega(\operatorname{poly}(d, 1/\epsilon))$ 個狀態-動作對上查詢獎勵函數,其中 $d$ 取決於環境過渡的複雜程度。