We study decentralized stochastic linear bandits, where a network of $N$ agents acts cooperatively to efficiently solve a linear bandit-optimization problem over a $d$-dimensional space. For this problem, we propose DLUCB: a fully decentralized algorithm that minimizes the cumulative regret over the entire network. At each round of the algorithm each agent chooses its actions following an upper confidence bound (UCB) strategy and agents share information with their immediate neighbors through a carefully designed consensus procedure that repeats over cycles. Our analysis adjusts the duration of these communication cycles ensuring near-optimal regret performance $\mathcal{O}(d\log{NT}\sqrt{NT})$ at a communication rate of $\mathcal{O}(dN^2)$ per round. The structure of the network affects the regret performance via a small additive term - coined the regret of delay - that depends on the spectral gap of the underlying graph. Notably, our results apply to arbitrary network topologies without a requirement for a dedicated agent acting as a server. In consideration of situations with high communication cost, we propose RC-DLUCB: a modification of DLUCB with rare communication among agents. The new algorithm trades off regret performance for a significantly reduced total communication cost of $\mathcal{O}(d^3N^{2.5})$ over all $T$ rounds. Finally, we show that our ideas extend naturally to the emerging, albeit more challenging, setting of safe bandits. For the recently studied problem of linear bandits with unknown linear safety constraints, we propose the first safe decentralized algorithm. Our study contributes towards applying bandit techniques in safety-critical distributed systems that repeatedly deal with unknown stochastic environments. We present numerical simulations for various network topologies that corroborate our theoretical findings.
翻译:我们研究的是分散式的线性匪徒, 由美元代理商组成的网络通过周而复始的周而复始的周而复始的周而复始的周而复始的共识程序, 与近邻共享信息。 我们的分析调整了这些通信周期的持续时间, 以确保近最佳的误差性能 $mathal{O}(d\log{NT{sqrt{NT}}) 。 对于这个问题, 我们建议DLUCB: 完全分散式算法, 最大限度地将整个网络的累积遗憾最小化。 网络的结构通过一个小的添加期影响其行动 — 造成延迟的遗憾 — 取决于底图的光谱差距。 值得注意的是, 我们的结果适用于任意的网络表层结构, 而不要求一个专门的代理商进行反复式的服务器 。 考虑到通信成本高, 我们用不固定式的直径直径直的运算法, 我们建议以未知的运算法技术来大幅降低运行状态。