Bandit Convex Optimization is a fundamental class of sequential decision-making problems, where the learner selects actions from a continuous domain and observes a loss (but not its gradient) at only one point per round. We study this problem in non-stationary environments, and aim to minimize the regret under three standard measures of non-stationarity: the number of switches $S$ in the comparator sequence, the total variation $Δ$ of the loss functions, and the path-length $P$ of the comparator sequence. We propose a polynomial-time algorithm, Tilted Exponentially Weighted Average with Sleeping Experts (TEWA-SE), which adapts the sleeping experts framework from online convex optimization to the bandit setting. For strongly convex losses, we prove that TEWA-SE is minimax-optimal with respect to known $S$ and $Δ$ by establishing matching upper and lower bounds. By equipping TEWA-SE with the Bandit-over-Bandit framework, we extend our analysis to environments with unknown non-stationarity measures. For general convex losses, we introduce a second algorithm, clipped Exploration by Optimization (cExO), based on exponential weights over a discretized action space. While not polynomial-time computable, this method achieves minimax-optimal regret with respect to known $S$ and $Δ$, and improves on the best existing bounds with respect to $P$.
翻译:老虎机凸优化是一类基础的序列决策问题,其中学习者在连续域中选择动作,并在每轮仅观测单点损失(而非其梯度)。本文研究非稳态环境下的该问题,旨在基于三种标准非稳态度量最小化遗憾:比较器序列的切换次数 $S$、损失函数的总变差 $Δ$,以及比较器序列的路径长度 $P$。我们提出一种多项式时间算法——基于休眠专家的倾斜指数加权平均法(TEWA-SE),该方法将在线凸优化中的休眠专家框架适配至老虎机设定。针对强凸损失函数,我们通过建立匹配的上下界证明:在已知 $S$ 和 $Δ$ 时,TEWA-SE 具有极小极大最优性。通过为 TEWA-SE 配备老虎机套娃框架,我们将分析扩展至非稳态度量未知的环境。对于一般凸损失函数,我们提出第二种算法——基于离散化动作空间指数加权的截断优化探索法(cExO)。虽然该方法不具备多项式时间可计算性,但在已知 $S$ 和 $Δ$ 时实现了极小极大最优遗憾,并在 $P$ 度量上改进了现有最佳边界。