In this paper, we study the problem of bandits with knapsacks (BwK) in a non-stationary environment. The BwK problem generalizes the multi-arm bandit (MAB) problem to model the resource consumption associated with playing each arm. At each time, the decision maker/player chooses to play an arm, and s/he will receive a reward and consume certain amount of resource from each of the multiple resource types. The objective is to maximize the cumulative reward over a finite horizon subject to some knapsack constraints on the resources. Existing works study the BwK problem under either a stochastic or adversarial environment. Our paper considers a non-stationary environment which continuously interpolates between these two extremes. We first show that the traditional notion of variation budget is insufficient to characterize the non-stationarity of the BwK problem for a sublinear regret due to the presence of the constraints, and then we propose a new notion of global non-stationarity measure. We employ both non-stationarity measures to derive upper and lower bounds for the problem. Our results are based on a primal-dual analysis of the underlying linear programs and highlight the interplay between the constraints and the non-stationarity. Finally, we also extend the non-stationarity measure to the problem of online convex optimization with constraints and obtain new regret bounds accordingly.
翻译:在本文中,我们研究了在非静止环境中使用 knapsacks (BwK) 的强盗问题。 BwK 问题概括了多武器强盗(MAB) 问题, 以模拟与玩弄每只手臂相关的资源消耗。 每次, 决策者/ 玩家选择玩一只手臂, 并且他/ 他将得到奖励, 并从多种资源类型中的每一种资源中消耗一定的资源。 目标是在资源受到某些 knapsack 限制的情况下, 在有限的地平线上最大限度地增加累积奖励。 现有的工作研究环境或是在沙沙或对抗性环境下研究 BwK 问题。 我们的文件认为, 一种非固定环境, 使这两种极端之间不断相互调和。 我们首先显示, 传统的变换预算概念不足以说明BwK 问题的不常态性, 因为存在各种制约, 因而造成亚线性遗憾。 然后我们提出了一个新的全球不静止措施概念。 我们采用非固定性措施, 来找出问题的上下界和下界界限。 我们的结果是基于一个直线性和在线限制, 也显示了我们最后的直线性限制, 。