We study the offline data-driven sequential decision making problem in the framework of Markov decision process (MDP). In order to enhance the generalizability and adaptivity of the learned policy, we propose to evaluate each policy by a set of the average rewards with respect to distributions centered at the policy induced stationary distribution. Given a pre-collected dataset of multiple trajectories generated by some behavior policy, our goal is to learn a robust policy in a pre-specified policy class that can maximize the smallest value of this set. Leveraging the theory of semi-parametric statistics, we develop a statistically efficient policy learning method for estimating the de ned robust optimal policy. A rate-optimal regret bound up to a logarithmic factor is established in terms of total decision points in the dataset.
翻译:我们在Markov决策程序(MDP)框架内研究离线数据驱动的顺序决策问题。为了提高所学政策的普遍性和适应性,我们提议通过一套以政策诱发的固定分布为核心的分布平均奖励来评估每项政策。鉴于预先收集的由某些行为政策产生的多轨数据集,我们的目标是在预先指定的政策类别中学习一项强有力的政策,以最大限度地增加这一组数据最小值。我们利用半参数统计理论,制定了一套统计效率高的政策学习方法,用以估计被淘汰的稳健最佳政策。在数据集的总决策点中,确定了与对数因素挂钩的最好率。