We develop a reinforcement learning (RL) framework for applications that deal with sequential decisions and exogenous uncertainty, such as resource allocation and inventory management. In these applications, the uncertainty is only due to exogenous variables like future demands. A popular approach is to predict the exogenous variables using historical data and then plan with the predictions. However, this indirect approach requires high-fidelity modeling of the exogenous process to guarantee good downstream decision-making, which can be impractical when the exogenous process is complex. In this work we propose an alternative approach based on hindsight learning which sidesteps modeling the exogenous process. Our key insight is that, unlike Sim2Real RL, we can revisit past decisions in the historical data and derive counterfactual consequences for other actions in these applications. Our framework uses hindsight-optimal actions as the policy training signal and has strong theoretical guarantees on decision-making performance. We develop an algorithm using our framework to allocate compute resources for real-world Microsoft Azure workloads. The results show our approach learns better policies than domain-specific heuristics and Sim2Real RL baselines.
翻译:我们开发了一个强化学习(RL)框架,用于处理相继决定和外部不确定性的应用,例如资源分配和库存管理。在这些应用中,不确定性只来自未来需求等外源变量。一种流行的方法是利用历史数据预测外源变量,然后进行预测。然而,这种间接方法要求外源过程的高度忠诚建模,以保证良好的下游决策,而当外源过程复杂时,这种建模可能不切实际。在这项工作中,我们提出了基于后见学习的替代方法,该方法在外源过程的建模方面有其他步骤。我们的主要洞察力是,与Sim2Real RL不同,我们可以重新审视历史数据中过去的决定,并对这些应用中的其他行动产生反实际后果。我们的框架使用后视最佳行动作为政策培训信号,对决策业绩有很强的理论保证。我们利用框架开发一种算法,为真实世界的Microsoft Azure工作量分配计算资源。结果显示我们的方法比具体域的超值和Sim2Real RL基线学习更好的政策。