We study the problems of offline and online contextual optimization with feedback information, where instead of observing the loss, we observe, after-the-fact, the optimal action an oracle with full knowledge of the objective function would have taken. We aim to minimize regret, which is defined as the difference between our losses and the ones incurred by an all-knowing oracle. In the offline setting, the decision-maker has information available from past periods and needs to make one decision, while in the online setting, the decision-maker optimizes decisions dynamically over time based a new set of feasible actions and contextual functions in each period. For the offline setting, we characterize the optimal minimax policy, establishing the performance that can be achieved as a function of the underlying geometry of the information induced by the data. In the online setting, we leverage this geometric characterization to optimize the cumulative regret. We develop an algorithm that yields the first regret bound for this problem that is logarithmic in the time horizon.
翻译:我们用反馈信息来研究离线和在线背景优化的问题,我们不观察损失,而是观察事后的最佳行动,一个完全了解目标功能的神器会采取的最佳行动。我们的目标是最大限度地减少遗憾,即我们的损失与全知神器造成的损失之间的差别。在离线设置中,决策者有以往时期的信息,需要做出一个决定,而在在线设置中,决策者根据时间动态优化决策,以新的一套可行的行动和每个时期的背景功能为基础。在离线设置中,我们将最佳的微缩政策定性为最佳的微缩政策,将所实现的绩效确定为数据所引引引信息的基本几何功能的函数。在在线设置中,我们利用这一几何特征优化累积的遗憾。我们开发一种算法,为在时间范围上对数的问题带来首个遗憾。