The multi-armed bandit(MAB) problem is a simple yet powerful framework that has been extensively studied in the context of decision-making under uncertainty. In many real-world applications, such as robotic applications, selecting an arm corresponds to a physical action that constrains the choices of the next available arms (actions). Motivated by this, we study an extension of MAB called the graph bandit, where an agent travels over a graph to maximize the reward collected from different nodes. The graph defines the agent's freedom in selecting the next available nodes at each step. We assume the graph structure is fully available, but the reward distributions are unknown. Built upon an offline graph-based planning algorithm and the principle of optimism, we design a learning algorithm, \texttt{G-UCB}, that balances long-term exploration-exploitation using the principle of optimism. We show that our proposed algorithm achieves $O(\sqrt{|S|T\log(T)}+D|S|\log T)$ learning regret, where $|S|$ is the number of nodes and $D$ is the diameter of the graph, which matches the theoretical lower bound $\Omega(\sqrt{|S|T})$ up to logarithmic factors. To our knowledge, this result is among the first tight regret bounds in non-episodic, un-discounted learning problems with known deterministic transitions. Numerical experiments confirm that our algorithm outperforms several benchmarks.
翻译:多臂土匪(MAB)问题是一个简单而有力的框架,在不确定的决策背景下已经对它进行了广泛研究。在许多真实世界的应用中,例如机器人应用中,选择一个臂与限制下一个可用臂的选择(动作)的物理动作相对应。受此驱动,我们研究一个称为图形土匪的MAB延伸,一个代理人通过一个图表旅行,以尽量扩大从不同节点收集的奖赏。图表定义了代理人在每一步选择下一个可用节点的自由。我们假设图表结构完全可用,但奖赏分布却未知。在基于离线的图表规划算法和乐观原则上,我们设计了一个学习算法,用乐观原则来平衡长期勘探和开发。我们提议的算法达到美元(sqrtQ ⁇ S ⁇ T\\\log (T)\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\