Multi-armed bandits (MAB) are extensively studied in various settings where the objective is to \textit{maximize} the actions' outcomes (i.e., rewards) over time. Since safety is crucial in many real-world problems, safe versions of MAB algorithms have also garnered considerable interest. In this work, we tackle a different critical task through the lens of \textit{linear stochastic bandits}, where the aim is to keep the actions' outcomes close to a target level while respecting a \textit{two-sided} safety constraint, which we call \textit{leveling}. Such a task is prevalent in numerous domains. Many healthcare problems, for instance, require keeping a physiological variable in a range and preferably close to a target level. The radical change in our objective necessitates a new acquisition strategy, which is at the heart of a MAB algorithm. We propose SALE-LTS: Safe Leveling via Linear Thompson Sampling algorithm, with a novel acquisition strategy to accommodate our task and show that it achieves sublinear regret with the same time and dimension dependence as previous works on the classical reward maximization problem absent any safety constraint. We demonstrate and discuss our algorithm's empirical performance in detail via thorough experiments.
翻译:多武装匪徒(MAB) 在不同环境中广泛研究, 目标是要长期\ textit{ maximize} 行动结果( 即奖励) 。 由于安全在许多现实世界问题中至关重要, 安全版的MAB算法也引起了相当大的兴趣。 在这项工作中, 我们通过 klextit{ linear schochatic 土匪的镜头来处理不同的关键任务 。 我们提议 SALE- LTS: 通过Linear Thompson Sampling 算法将行动结果稳定到一个目标水平, 尊重我们称之为 Textit{ text- sloppling 的安全限制 。 这种任务在许多领域十分普遍。 例如, 许多保健问题需要将生理变量保持在一个范围, 最好也接近目标水平。 我们的目标需要通过新的获取战略, 也就是在 MAB 算法的核心。 我们提议 SALE- LTS: 安全等级化, 通过Linear Thomps Sampling 算法, 来适应我们的任务, 并显示它没有以任何前一线段时间和层面的细细微的细微的实验来进行我们的工作。