At the boundary between the known and the unknown, an agent inevitably confronts the dilemma of whether to explore or to exploit. Epistemic uncertainty reflects such boundaries, representing systematic uncertainty due to limited knowledge. In this paper, we propose a Bayesian reinforcement learning (RL) algorithm, $\texttt{EUBRL}$, which leverages epistemic guidance to achieve principled exploration. This guidance adaptively reduces per-step regret arising from estimation errors. We establish nearly minimax-optimal regret and sample complexity guarantees for a class of sufficiently expressive priors in infinite-horizon discounted MDPs. Empirically, we evaluate $\texttt{EUBRL}$ on tasks characterized by sparse rewards, long horizons, and stochasticity. Results demonstrate that $\texttt{EUBRL}$ achieves superior sample efficiency, scalability, and consistency.
翻译:在已知与未知的边界上,智能体不可避免地面临探索与利用的抉择困境。认知不确定性正是这种边界的体现,它表征了因知识有限而产生的系统性不确定性。本文提出一种贝叶斯强化学习算法——$\texttt{EUBRL}$,该算法通过认知导向实现原理性探索。这种导向机制能自适应地降低因估计误差产生的每步遗憾。我们在无限时域折扣马尔可夫决策过程框架下,针对一类充分表达的先验分布,建立了近乎极小极大最优的遗憾界与样本复杂度保证。实验部分,我们在具有稀疏奖励、长时域和随机性特征的任务上评估$\texttt{EUBRL}$。结果表明,$\texttt{EUBRL}$在样本效率、可扩展性和一致性方面均表现出优越性能。