The multi-armed bandit(MAB) problem is a simple yet powerful framework that has been extensively studied in the context of decision-making under uncertainty. In many real-world applications, such as robotic applications, selecting an arm corresponds to a physical action that constrains the choices of the next available arms (actions). Motivated by this, we study an extension of MAB called the graph bandit, where an agent travels over a graph to maximize the reward collected from different nodes. The graph defines the agent's freedom in selecting the next available nodes at each step. We assume the graph structure is fully available, but the reward distributions are unknown. Built upon an offline graph-based planning algorithm and the principle of optimism, we design a learning algorithm, G-UCB, that balances long-term exploration-exploitation using the principle of optimism. We show that our proposed algorithm achieves $O(\sqrt{|S|T\log(T)}+D|S|\log T)$ learning regret, where $|S|$ is the number of nodes and $D$ is the diameter of the graph, which matches the theoretical lower bound $\Omega(\sqrt{|S|T})$ up to logarithmic factors. To our knowledge, this result is among the first tight regret bounds in non-episodic, un-discounted learning problems with known deterministic transitions. Numerical experiments confirm that our algorithm outperforms several benchmarks.
翻译:摘要:多臂赌博机(MAB)问题是一个简单而有力的框架,已经在不确定条件下的决策制定方面得到了广泛的研究。在许多实际应用中,比如机器人应用,选择一个臂对应着一项物理动作,这限制了下一个可用臂(行动)的选择。出于这个动机,我们研究了MAB的一种扩展,即图赌博机,在此扩展中,代理人在图上行进,以最大化从不同节点收集的回报。图定义了代理在每一步中选择下一个可用节点的自由。我们假设图结构是完全可用的,但奖励分布是未知的。基于离线基于图的规划算法和乐观主义原则,我们设计了一种学习算法G-UCB,它平衡了长期的探索和利用。我们证明了我们提出的算法实现了$O(\sqrt{|S|T\log(T)}+D|S|\log T)$的学习痛苦,其中$|S|$是节点数,$D$是图的直径,这与理论下界$\Omega(\sqrt{|S|T})$匹配,直到对数因子。据我们所知,这个结果是在已知的确定性转换下的非情节性、不打折的学习问题中的第一个紧 regret 边界。数字实验证实了我们的算法优于几个基准。