We consider the regret minimization problem in reinforcement learning (RL) in the episodic setting. In many real-world RL environments, the state and action spaces are continuous or very large. Existing approaches establish regret guarantees by either a low-dimensional representation of the stochastic transition model or an approximation of the $Q$-functions. However, the understanding of function approximation schemes for state-value functions largely remains missing. In this paper, we propose an online model-based RL algorithm, namely the CME-RL, that learns representations of transition distributions as embeddings in a reproducing kernel Hilbert space while carefully balancing the exploitation-exploration tradeoff. We demonstrate the efficiency of our algorithm by proving a frequentist (worst-case) regret bound that is of order $\tilde{O}\big(H\gamma_N\sqrt{N}\big)$\footnote{ $\tilde{O}(\cdot)$ hides only absolute constant and poly-logarithmic factors.}, where $H$ is the episode length, $N$ is the total number of time steps and $\gamma_N$ is an information theoretic quantity relating the effective dimension of the state-action feature space. Our method bypasses the need for estimating transition probabilities and applies to any domain on which kernels can be defined. It also brings new insights into the general theory of kernel methods for approximate inference and RL regret minimization.
翻译:我们认为,在偶发环境下,在强化学习(RL)的最小化功能中存在最小化问题。 在许多现实世界的 RL 环境中, 状态和动作空间是连续的或非常大。 现有的方法通过对随机过渡模式的低维代表或对美元功能的近似值来建立遗憾保证。 但是, 对基于国家价值函数的功能近似功能计划的理解仍然基本缺乏。 在本文中, 我们提议基于基于在线模型的 RL 算法, 即 CME- RL 算法, 该算法将过渡分布的表示作为复制内核循环的Hilbert空间, 同时仔细平衡开发- 勘探交易。 我们通过证明一个常态( worst- case) 模式的低维度( wort- case) 来证明我们的算法效率。 $\\ big (H\\ gamamamamamamamamama_ Nqrial roblection 方法中, 也可以将任何绝对的常态和多度的精确度方法 。