We develop an extension of posterior sampling for reinforcement learning (PSRL) that is suited for a continuing agent-environment interface and integrates naturally into agent designs that scale to complex environments. The approach, continuing PSRL, maintains a statistically plausible model of the environment and follows a policy that maximizes expected $\gamma$-discounted return in that model. At each time, with probability $1-\gamma$, the model is replaced by a sample from the posterior distribution over environments. For a choice of discount factor that suitably depends on the horizon $T$, we establish an $\tilde{O}(\tau S \sqrt{A T})$ bound on the Bayesian regret, where $S$ is the number of environment states, $A$ is the number of actions, and $\tau$ denotes the reward averaging time, which is a bound on the duration required to accurately estimate the average reward of any policy. Our work is the first to formalize and rigorously analyze the resampling approach with randomized exploration.
翻译:我们开发了用于强化学习的后表取样(PSRL)的扩展,该模型适合一种连续的代理-环境界面,并自然地融入到规模到复杂环境的代理设计中。该方法,继续PSRL,维持一个统计上合理的环境模型,并遵循一个政策,使该模型的预期回报最大化(gamma$-折扣)。每次,以1美元的可能性,该模型被从后表分布到环境的样本所取代。为了选择适当取决于地平线$T$的折扣系数,我们设置了一个$\tilde{O}(\tau S\sqrt{A T})的折扣系数,我们设置了一个以Bayesian遗憾为约束的$,其中美元是环境状态的数量,美元是行动的数量,美元是平均时间,这是准确估计任何政策的平均报酬所需的时间。我们的工作是首先正式和严格分析随机勘探的重新标定的方法。