We propose \emph{Choquet regularizers} to measure and manage the level of exploration for reinforcement learning (RL), and reformulate the continuous-time entropy-regularized RL problem of Wang et al. (2020, JMLR, 21(198)) in which we replace the differential entropy used for regularization with a Choquet regularizer. We derive the Hamilton--Jacobi--Bellman equation of the problem, and solve it explicitly in the linear--quadratic (LQ) case via maximizing statically a mean--variance constrained Choquet regularizer. Under the LQ setting, we derive explicit optimal distributions for several specific Choquet regularizers, and conversely identify the Choquet regularizers that generate a number of broadly used exploratory samplers such as $\epsilon$-greedy, exponential, uniform and Gaussian.
翻译:我们建议 \ emph{ Choque 管理器} 测量和管理强化学习的勘探水平( RL ), 重新配置Wang et al. ( 2020, JMLR, 21 (198)) 的连续对流的RL 问题, 以此用Choquet 常规化器取代用于正规化的差异的 entropy 。 我们从这个问题的汉密尔顿- Jacobi- Bellman 方程式中提取问题, 并通过静态地最大化平均逆差限制Choquet 常规化器( LQ) 来明确解决线性- 问题。 在 LQ 设置下, 我们为几个特定的 Choquet 常规化器获得明确的最佳分配, 并反过来识别 Choquet 的常规化器, 产生大量广泛使用的探索性采样器, 如 $\ psilon- greedy, 指数、 制服和高斯 。