In stochastic contextual bandit problems, an agent sequentially makes actions from a time-dependent action set based on past experience to minimize the cumulative regret. Like many other machine learning algorithms, the performance of bandits heavily depends on their multiple hyperparameters, and theoretically derived parameter values may lead to unsatisfactory results in practice. Moreover, it is infeasible to use offline tuning methods like cross validation to choose hyperparameters under the bandit environment, as the decisions should be made in real time. To address this challenge, we propose the first online continuous hyperparameter tuning framework for contextual bandits to learn the optimal parameter configuration within a search space on the fly. Specifically, we use a double-layer bandit framework named CDT (Continuous Dynamic Tuning) and formulate the hyperparameter optimization as a non-stationary continuum-armed bandit, where each arm represents a combination of hyperparameters, and the corresponding reward is the algorithmic result. For the top layer, we propose the Zooming TS algorithm that utilizes Thompson Sampling (TS) for exploration and a restart technique to get around the switching environment. The proposed CDT framework can be easily used to tune contextual bandit algorithms without any pre-specified candidate set for hyperparameters. We further show that it could achieve sublinear regret in theory and performs consistently better on both synthetic and real datasets in practice.
翻译:在随机背景的土匪问题中,代理商根据过去的经验,从一个时间依赖的行动中按顺序采取行动,以尽量减少累积的遗憾。像许多其他机器学习算法一样,土匪的性能在很大程度上取决于其多重超参数,理论上衍生参数值在实践中可能导致不满意的结果。此外,使用离线调试方法,如交叉校准,在土匪环境中选择超参数是不可行的,因为应当实时作出决定。为了应对这一挑战,我们建议为背景土匪建立第一个在线连续超参数调框架,以学习在飞行搜索空间内的最佳参数配置。具体地说,我们使用名为 CDT(持续动态图案)的双层土匪框架,并将超参数优化设计为非静止的连续控制带带,每个臂是超参数的组合,相应的奖励是算法结果。关于顶层,我们建议采用利用Thompson Sampling(TS) 进行探索和重新启动技术以在飞行空间搜索空间中学习最佳参数配置。具体地,我们使用一个双层的双层框架(持续动态动态电图),将超光谱仪优化地在转换环境中进行。我们所拟议的CD-ral-ral-ral-ral-sal-sal-slabisal-smabislabisal bes