转载贝叶斯在线变点检测在非平稳马尔可夫决策过程中的应用 (Restarted Bayesian Online Change-point Detection for Non-Stationary Markov Decision Processes)

We consider the problem of learning in a non-stationary reinforcement learning (RL) environment, where the setting can be fully described by a piecewise stationary discrete-time Markov decision process (MDP). We introduce a variant of the Restarted Bayesian Online Change-Point Detection algorithm (R-BOCPD) that operates on input streams originating from the more general multinomial distribution and provides near-optimal theoretical guarantees in terms of false-alarm rate and detection delay. Based on this, we propose an improved version of the UCRL2 algorithm for MDPs with state transition kernel sampled from a multinomial distribution, which we call R-BOCPD-UCRL2. We perform a finite-time performance analysis and show that R-BOCPD-UCRL2 enjoys a favorable regret bound of $O\left(D O \sqrt{A T K_T \log\left (\frac{T}{\delta} \right) + \frac{K_T \log \frac{K_T}{\delta}}{\min\limits_\ell \: \mathbf{KL}\left( {\mathbf{\theta}^{(\ell+1)}}\mid\mid{\mathbf{\theta}^{(\ell)}}\right)}}\right)$, where $D$ is the largest MDP diameter from the set of MDPs defining the piecewise stationary MDP setting, $O$ is the finite number of states (constant over all changes), $A$ is the finite number of actions (constant over all changes), $K_T$ is the number of change points up to horizon $T$, and $\mathbf{\theta}^{(\ell)}$ is the transition kernel during the interval $[c_\ell, c_{\ell+1})$, which we assume to be multinomially distributed over the set of states $\mathbb{O}$. Interestingly, the performance bound does not directly scale with the variation in MDP state transition distributions and rewards, ie. can also model abrupt changes. In practice, R-BOCPD-UCRL2 outperforms the state-of-the-art in a variety of scenarios in synthetic environments. We provide a detailed experimental setup along with a code repository (upon publication) that can be used to easily reproduce our experiments.

翻译：本文考虑在非平稳强化学习环境下的学习问题，其中该环境可以完全由一个分段平稳离散时间马尔可夫决策过程描述。我们引入了一种重启贝叶斯在线变点检测算法（Restarted Bayesian Online Change-Point Detection，简称R-BOCPD），该算法适用于来自更一般的多项式分布的输入流，并提供了在误警率和检测延迟方面的近乎最优理论保证。基于此，我们提出了一种针对具有从多项式分布中采样的状态转移核的马尔可夫决策过程的改进版本的UCRL2算法，称为R-BOCPD-UCRL2。我们进行了有限时间性能分析，并展示了R-BOCPD-UCRL2具有有利的遗憾界$O\left(D O \sqrt{A T K_T \log\left (\frac{T}{\delta} \right) + \frac{K_T \log \frac{K_T}{\delta}}{\min\limits_\ell \: \mathbf{KL}\left( {\mathbf{\theta}^{(\ell+1)}}\mid\mid{\mathbf{\theta}^{(\ell)}}\right)}}\right)$，其中$D$是定义分段平稳MDP设置的MDP集合中的最大MDP直径，$O$是有限状态数（在所有变化中恒定），$A$是有限的动作数（在所有变化中恒定），$K_T$是到horizon $T$的变化点数，$\mathbf{\theta}^{(\ell)}$ 表示在时间区间$[c_\ell，c_{\ell+1})$内的转移核，假设其在状态集$\mathbb{O}$上服从多项式分布。有趣的是，该性能上界不直接与MDP状态转移分布和奖励的变化程度相关，也就是可以模型化突然的变化。在实践中，R-BOCPD-UCRL2在各种情景中胜过了现有技术。我们提供了一个详细的实验设置以及一个可用于轻松重现我们实验的代码库（在刊物出版后）。

相关内容

马尔可夫决策过程

关注 23

马尔可夫决策过程（MDP）提供了一个数学框架，用于在结果部分随机且部分受决策者控制的情况下对决策建模。 MDP可用于研究通过动态编程和强化学习解决的各种优化问题。 MDP至少早在1950年代就已为人所知（参见）。马尔可夫决策过程的研究核心是罗纳德·霍华德（Ronald A. Howard）于1960年出版的《动态编程和马尔可夫过程》一书。它们被广泛用于各种学科，包括机器人技术，自动控制，经济学和制造。更精确地，马尔可夫决策过程是离散的时间随机控制过程。在每个时间步骤中，流程都处于某种状态，决策者可以选择该状态下可用的任何操作。该过程在下一时间步响应，随机进入新状态，并给予决策者相应的奖励。流程进入新状态的可能性受所选动作的影响。具体而言，它由状态转换函数给出。因此，下一个状态取决于当前状态和决策者的动作。但是给定和，它有条件地独立于所有先前的状态和动作；换句话说，MDP进程的状态转换满足Markov属性。马尔可夫决策过程是马尔可夫链的扩展。区别在于增加了动作（允许选择）和奖励（给予动机）。相反，如果每个状态仅存在一个动作（例如“等待”）并且所有奖励都相同（例如“零”），则马尔可夫决策过程将简化为马尔可夫链。

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

【ICML2021】策略梯度贝叶斯鲁棒优化的模仿学习

专知会员服务

25+阅读 · 2021年6月15日

【KDD2020】动态图的拉普拉斯变换点检测，Laplacian Change Point Detection for Dynamic Graphs

专知会员服务

38+阅读 · 2020年7月3日

【CVPR2020】视觉跟踪的概率回归，Probabilistic Regression for Visual Tracking

专知会员服务

37+阅读 · 2020年3月27日