We study model-based reinforcement learning (RL) for episodic Markov decision processes (MDP) whose transition probability is parametrized by an unknown transition core with features of state and action. Despite much recent progress in analyzing algorithms in the linear MDP setting, the understanding of more general transition models is very restrictive. In this paper, we establish a provably efficient RL algorithm for the MDP whose state transition is given by a multinomial logistic model. To balance the exploration-exploitation trade-off, we propose an upper confidence bound-based algorithm. We show that our proposed algorithm achieves $\tilde{\mathcal{O}}(d \sqrt{H^3 T})$ regret bound where $d$ is the dimension of the transition core, $H$ is the horizon, and $T$ is the total number of steps. To the best of our knowledge, this is the first model-based RL algorithm with multinomial logistic function approximation with provable guarantees. We also comprehensively evaluate our proposed algorithm numerically and show that it consistently outperforms the existing methods, hence achieving both provable efficiency and practical superior performance.
翻译:我们研究以模型为基础的强化学习(RL),该模型的过渡概率被具有状态和行动特点的未知过渡核心所抵消。尽管最近在分析线性MDP设置的算法方面取得了很大进展,但对更一般的过渡模式的理解非常限制性。在本文中,我们为MDP为状态过渡由多名后勤模式给予的多名后勤模式建立了一种可实现效率的RL算法。为了平衡勘探-开发交易,我们提议了一个高度信任约束算法。我们显示,我们提议的算法实现了美元(tilde_mathcal{O}(d\sqrt{H}3T})的过渡核心维度,美元(d d d qrt{H}3T})的后背值,而美元是过渡核心维度的维度,而$(t)是步骤的总数。据我们所知,这是第一个基于模型的RL算法,与多名后勤功能相近,我们还全面评价了我们提议的算法,并显示它始终超越了现有方法,从而实现了可实现实际效率和高超度。