We consider the contextual bandit problem, where a player sequentially makes decisions based on past observations to maximize the cumulative reward. Although many algorithms have been proposed for contextual bandit, most of them rely on finding the maximum likelihood estimator at each iteration, which requires $O(t)$ time at the $t$-th iteration and are memory inefficient. A natural way to resolve this problem is to apply online stochastic gradient descent (SGD) so that the per-step time and memory complexity can be reduced to constant with respect to $t$, but a contextual bandit policy based on online SGD updates that balances exploration and exploitation has remained elusive. In this work, we show that online SGD can be applied to the generalized linear bandit problem. The proposed SGD-TS algorithm, which uses a single-step SGD update to exploit past information and uses Thompson Sampling for exploration, achieves $\tilde{O}(\sqrt{T})$ regret with the total time complexity that scales linearly in $T$ and $d$, where $T$ is the total number of rounds and $d$ is the number of features. Experimental results show that SGD-TS consistently outperforms existing algorithms on both synthetic and real datasets.
翻译:我们考虑了背景土匪问题,在这个问题上,一个玩家根据以往的观测顺序做出决策,以最大限度地实现累积奖励。虽然为背景土匪提出了许多算法,但大多数算法都依赖在每次迭代中找到最大可能性估计器,这需要美元(t)美元(t)时间(美元)在美元(t)转折中,记忆效率低下。一个解决该问题的自然办法是应用在线随机偏差梯底部(SGD),以便每步时间和记忆复杂性在美元方面可以降低到不变,但基于在线SGD更新的背景土匪政策仍然无法平衡勘探和开采。在这项工作中,我们显示在线SGD可以应用于普遍的线条宽带问题。 SGD-TS的拟议算法使用单步SGD更新来利用过去的信息并利用Thompson Sampling进行勘探,实现了$(sqrt{O}(sqrt{t{t})美元(sqrock)对于以美元和$($)美元($)为直线度和美元($)的SLGDGDA和美元(SLA)的合成数据总数。