从特点中学习的MDP:通过强化学习对序列决定问题的预测-随后-优化 (Learning MDPs from Features: Predict-Then-Optimize for Sequential Decision Problems by Reinforcement Learning)

In the predict-then-optimize framework, the objective is to train a predictive model, mapping from environment features to parameters of an optimization problem, which maximizes decision quality when the optimization is subsequently solved. Recent work on decision-focused learning shows that embedding the optimization problem in the training pipeline can improve decision quality and help generalize better to unseen tasks compared to relying on an intermediate loss function for evaluating prediction quality. We study the predict-then-optimize framework in the context of sequential decision problems (formulated as MDPs) that are solved via reinforcement learning. In particular, we are given environment features and a set of trajectories from training MDPs, which we use to train a predictive model that generalizes to unseen test MDPs without trajectories. Two significant computational challenges arise in applying decision-focused learning to MDPs: (i) large state and action spaces make it infeasible for existing techniques to differentiate through MDP problems, and (ii) the high-dimensional policy space, as parameterized by a neural network, makes differentiating through a policy expensive. We resolve the first challenge by sampling provably unbiased derivatives to approximate and differentiate through optimality conditions, and the second challenge by using a low-rank approximation to the high-dimensional sample-based derivatives. We implement both Bellman--based and policy gradient--based decision-focused learning on three different MDP problems with missing parameters, and show that decision-focused learning performs better in generalization to unseen tasks.

翻译：在预测-最佳化框架内,目标是从环境特征到优化问题参数,对优化问题进行预测模型,从环境特征到优化参数进行绘图,在随后解决优化时最大限度地提高决策质量。最近关于注重决策的学习工作表明,将优化问题嵌入培训管道,可以提高决策质量,并有助于比依赖中期损失功能来评价预测质量,更好地概括到不可见的任务。我们研究在通过强化学习解决的顺序决策问题(以 MDP 形式形成)背景下的预测-最佳框架。特别是,我们从培训中获得了环境特征和一套从培训中最大限度地提高决策质量的轨迹。我们用这种方式来培训一个预测模型,将秘密测试 MDP 纳入培训管道中,这样可以提高决策质量,而不用轨迹。在将决策重点和深度分析中,我们通过抽样分析,通过高层次分析,将高层次政策评估与高层次决策的难度与高层次分析加以区分。我们通过高层次分析,通过高层次分析,通过高层次分析,通过高层次的模版图来解决高层次政策挑战。