We consider an online revenue maximization problem over a finite time horizon subject to lower and upper bounds on cost. At each period, an agent receives a context vector sampled i.i.d. from an unknown distribution and needs to make a decision adaptively. The revenue and cost functions depend on the context vector as well as some fixed but possibly unknown parameter vector to be learned. We propose a novel offline benchmark and a new algorithm that mixes an online dual mirror descent scheme with a generic parameter learning process. When the parameter vector is known, we demonstrate an $O(\sqrt{T})$ regret result as well an $O(\sqrt{T})$ bound on the possible constraint violations. When the parameter is not known and must be learned, we demonstrate that the regret and constraint violations are the sums of the previous $O(\sqrt{T})$ terms plus terms that directly depend on the convergence of the learning process.
翻译:我们认为,在一定的时间范围内,在成本的下限和上限范围内,网上收入最大化是一个在线收入问题。在每一时期,代理商都从未知的分布上采集上下文矢量样本i.d.d.,需要做出适应性决定。收益和成本功能取决于上下文矢量,以及一些固定但可能未知的参数矢量,有待学习。我们提议了一个新的离线基准和一种新的算法,将在线双镜下行方案与通用参数学习过程混为一谈。当参数矢量已知时,我们展示了美元(sqrt{T})的遗憾结果,以及美元(sqrt{T})对可能违反约束性规定的约束。当参数不为人所知且必须了解时,我们证明,违反该参数的遗憾和制约是前一个$O(sqrt{T}$)的总额加上直接取决于学习过程的趋同条件。