基于softmax近似的最优动态策略推断 (Inference on Optimal Dynamic Policies via Softmax Approximation)

Estimating optimal dynamic policies from offline data is a fundamental problem in dynamic decision making. In the context of causal inference, the problem is known as estimating the optimal dynamic treatment regime. Even though there exists a plethora of methods for estimation, constructing confidence intervals for the value of the optimal regime and structural parameters associated with it is inherently harder, as it involves non-linear and non-differentiable functionals of un-known quantities that need to be estimated. Prior work resorted to sub-sample approaches that can deteriorate the quality of the estimate. We show that a simple soft-max approximation to the optimal treatment regime, for an appropriately fast growing temperature parameter, can achieve valid inference on the truly optimal regime. We illustrate our result for a two-period optimal dynamic regime, though our approach should directly extend to the finite horizon case. Our work combines techniques from semi-parametric inference and $g$-estimation, together with an appropriate triangular array central limit theorem, as well as a novel analysis of the asymptotic influence and asymptotic bias of softmax approximations.

翻译：从离线数据中估算最优的动态策略是动态决策中的基本问题。在因果推断的背景下，该问题被称为最优动态治疗方案的估计。尽管存在大量的估计方法，但构建置信区间以及与其相关的结构参数的值是困难的，因为它涉及到未知数量的非线性和非可微函数，需要进行估计。以往的研究采用子抽样方法，但这可能会降低估计的质量。我们证明，对于适当增长的温度参数，一个简单的softmax近似可以实现对真正最优策略的有效推断。我们通过一个两期动态最优治疗方案来说明我们的结果，虽然我们的方法应该直接推广到有限时间范围内的情况。我们的工作结合了半参数推断、g-估计的技术，以及适当的三角阵中心极限定理，以及对softmax近似的渐近影响和渐近偏差的新颖分析。