This paper studies offline policy learning, which aims at utilizing observations collected a priori (from either fixed or adaptively evolving behavior policies) to learn the optimal individualized decision rule in a given class. Existing policy learning methods rely on a uniform overlap assumption, i.e., the propensities of exploring all actions for all individual characteristics are lower bounded in the offline dataset. In other words, the performance of these methods depends on the worst-case propensity in the offline dataset. As one has no control over the data collection process, this assumption can be unrealistic in many situations, especially when the behavior policies are allowed to evolve over time with diminishing propensities. In this paper, we propose a new algorithm that optimizes lower confidence bounds (LCBs) -- instead of point estimates -- of the policy values. The LCBs are constructed by quantifying the estimation uncertainty of the augmented inverse propensity weighted (AIPW)-type estimators using knowledge of the behavior policies for collecting the offline data. Without assuming any uniform overlap condition, we establish a data-dependent upper bound for the suboptimality of our algorithm, which depends only on (i) the overlap for the optimal policy, and (ii) the complexity of the policy class. As an implication, for adaptively collected data, we ensure efficient policy learning as long as the propensities for optimal actions are lower bounded over time, while those for suboptimal ones are allowed to diminish arbitrarily fast. In our theoretical analysis, we develop a new self-normalized concentration inequality for IPW estimators, generalizing the well-known empirical Bernstein's inequality to unbounded and non-i.i.d. data.
翻译:本文的论文研究是离线政策学习,目的是利用事先(从固定或适应性演变中的行为政策)收集的观察结果,学习某个类别的最佳个性化决策规则。现有的政策学习方法依靠的是统一的重叠假设,即探索所有个人特性的所有行动的倾向在离线数据集中受约束程度较低。换句话说,这些方法的性能取决于离线数据集中最坏的任意偏差倾向。由于人们无法控制数据收集过程,在许多情况下,这种假设可能是不现实的,特别是当行为政策随时间变化而随着运动倾向的减少而变化时。在本文件中,我们建议一种新的算法,优化政策值的较低信任度(LCB) -- 而不是点估计值。LCB是通过量化反向偏差偏差的偏差偏差加权(AIPW) 类型的估测器使用行为政策知识来收集离线数据。在不假定任何统一的重叠状态下,我们建立数据不依赖的上层框框定值, 也就是我们所收集的精确的自我分析, 而我们所收集的亚性政策, 取决于我们所收集的精度的精度(我们所收集的精度的精度的精度的精度) 的自我分析, 而我们所收集的精度的精度的精度的精度, 也就是的精度的精度的精度, 也就是的精度的精度的精度的精度, 也就是的精度的精度的精度, 我们的精度的精度的精度的精度的精度的精度的精度的精度的精度的精度的精度的精度的精度, 我们的精度的精度的精度的精度的精度的精度的精度, 我们的精度, 度的精度的精度的精度的精度的精度, 我们的精度的精度的精度的精度的精度的精度的精度的精度, 我们的精度的精度的精度的精度的精度的精度的精度的精度的精度的精度的精度的精度的精度的精度的精度的精度的精度的精度的精度的精度的精度,只是的精度的精度的精度的精度</s>