We consider a general online stochastic optimization problem with multiple budget constraints over a horizon of finite time periods. In each time period, a reward function and multiple cost functions are revealed, and the decision maker needs to specify an action from a convex and compact action set to collect the reward and consume the budget. Each cost function corresponds to the consumption of one budget. In each period, the reward and cost functions are drawn from an unknown distribution, which is non-stationary across time. The objective of the decision maker is to maximize the cumulative reward subject to the budget constraints. This formulation captures a wide range of applications including online linear programming and network revenue management, among others. In this paper, we consider two settings: (i) a data-driven setting where the true distribution is unknown but a prior estimate (possibly inaccurate) is available; (ii) an uninformative setting where the true distribution is completely unknown. We propose a unified Wasserstein-distance based measure to quantify the inaccuracy of the prior estimate in setting (i) and the non-stationarity of the system in setting (ii). We show that the proposed measure leads to a necessary and sufficient condition for the attainability of a sublinear regret in both settings. For setting (i), we propose a new algorithm, which takes a primal-dual perspective and integrates the prior information of the underlying distributions into an online gradient descent procedure in the dual space. The algorithm also naturally extends to the uninformative setting (ii). Under both settings, we show the corresponding algorithm achieves a regret of optimal order. In numerical experiments, we demonstrate how the proposed algorithms can be naturally integrated with the re-solving technique to further boost the empirical performance.
翻译:我们考虑的是一般的在线随机优化问题,在一定的时段范围内存在多种预算限制。 在每个时段, 都会披露奖赏功能和多种成本功能, 决策者需要指定一个收集奖赏和消耗预算的精密和紧凑动作。 每个成本功能与一个预算的消耗量相对应。 在每一期间, 奖赏和成本功能来自一个未知的分布, 这个分配是非固定的。 决策者的目标是在预算限制的情况下最大限度地增加累积的奖赏。 这个提法包含一系列广泛的应用, 包括在线线性编程和网络收入管理等。 在本文中, 我们考虑两个设置的设置:(一) 一个数据驱动的设置, 真实的分布是未知的, 之前的估算是准确的( 可能不准确的) ; 在每个期间, 奖赏和成本函数来自一个未知的分布, 而不是固定的。 我们提出一个基于统一的 瓦瑟斯坦- 远程 衡量尺度, 以量化先前估算的不准确性( 一) 和系统在设置过程中的不固定性( 二) 。 我们表示, 预估测的递增的逻辑中, 。