We develop an extension of posterior sampling for reinforcement learning (PSRL) that is suited for a continuing agent-environment interface and integrates naturally into agent designs that scale to complex environments. The approach maintains a statistically plausible model of the environment and follows a policy that maximizes expected $\gamma$-discounted return in that model. At each time, with probability $1-\gamma$, the model is replaced by a sample from the posterior distribution over environments. For a suitable schedule of $\gamma$, we establish an $\tilde{O}(\tau S \sqrt{A T})$ bound on the Bayesian regret, where $S$ is the number of environment states, $A$ is the number of actions, and $\tau$ denotes the reward averaging time, which is a bound on the duration required to accurately estimate the average reward of any policy.
翻译:我们开发了用于强化学习的后继取样(PSRL)的延伸,适合一种持续的代理-环境界面,并自然地融入到规模到复杂环境的代理设计中。该方法保持了在统计上可信的环境模型,并遵循了一种政策,该模型的预期回报率最大化(gamma$-贴现)。每次,可能为1美元-gamma$,该模型被环境的后继分布样本所取代。对于一个适当的时间表($\gamma$),我们设置了一个由Bayesian遗憾($S$是环境状态的数量,$A$是行动的数量,而$&tau$表示平均报酬时间,这取决于准确估计任何政策的平均报酬所需的时间。