We consider the problem of decision-making under uncertainty in an environment with safety constraints. Many business and industrial applications rely on real-time optimization with changing inputs to improve key performance indicators. In the case of unknown environmental characteristics, real-time optimization becomes challenging, particularly for the satisfaction of safety constraints. We propose the ARTEO algorithm, where we cast multi-armed bandits as a mathematical programming problem subject to safety constraints and learn the environmental characteristics through changes in optimization inputs and through exploration. We quantify the uncertainty in unknown characteristics by using Gaussian processes and incorporate it into the utility function as a contribution which drives exploration. We adaptively control the size of this contribution using a heuristic in accordance with the requirements of the environment. We guarantee the safety of our algorithm with a high probability through confidence bounds constructed under the regularity assumptions of Gaussian processes. Compared to existing safe-learning approaches, our algorithm does not require an exclusive exploration phase and follows the optimization goals even in the explored points, which makes it suitable for safety-critical systems. We demonstrate the safety and efficiency of our approach with two experiments: an industrial process and an online bid optimization benchmark problem.
翻译:许多商业和工业应用都依靠实时优化,通过改变投入来改进关键业绩指标。在环境特点不明的情况下,实时优化变得具有挑战性,特别是为了满足安全限制因素。我们建议采用ARTEO算法,在这种算法中,我们将多武装土匪作为一个数学编程问题,但受安全限制,并通过优化投入和探索的变化来了解环境特征。我们利用高山程序来量化未知特征的不确定性,并将之纳入推动探索的实用功能。我们根据环境要求采用超强控制这一贡献的大小。我们通过在高山进程常规假设下建立的信任界限,我们保证我们的算法具有很高的概率。与现有的安全学习方法相比,我们的算法并不需要一个专门的探索阶段,而是遵循优化目标,即使是在探索的点上,它也适合安全临界系统。我们用两种实验来展示我们的方法的安全和效率:工业过程和在线投标优化基准问题。