We consider a varying horizon Markov decision process (MDP), where each policy is evaluated by a set containing average rewards over different horizon lengths with different reference distributions. Given a pre-collected dataset of multiple trajectories generated by some behavior policy, our goal is to learn a robust policy in a pre-specified policy class that can approximately maximize the smallest value of this set. Leveraging semi-parametric statistics, we develop an efficient policy learning method for estimating the defined robust optimal policy that can efficiently break the curse of horizon. A rate-optimal regret bound up to a logarithmic factor is established in terms of the number of trajectories and the number of decision points. Our regret guarantee subsumes the long-term average reward MDP setting as a special case.
翻译:我们考虑不同的地平线Markov 决策程序(MDP ), 每条政策都用一套包含不同地平线长度的平均回报和不同参考分布的一套方法来评估。 鉴于事先收集的由某些行为政策产生的多轨数据集,我们的目标是在预先指定的政策类别中学习一种稳健的政策,该政策类别可以使这套政策最小的价值达到最大化。我们利用半参数统计数据,开发一种有效的政策学习方法,用以估计能够有效打破地平线诅咒的既定稳健最佳政策。从轨迹数量和决定点数量来看,确定了一个符合对数因素的速率-最佳遗憾。我们的遗憾保证将长期平均奖励 MDP 设置的子集成为特例。