We consider estimation and inference with data collected from episodic reinforcement learning (RL) algorithms; i.e. adaptive experimentation algorithms that at each period (aka episode) interact multiple times in a sequential manner with a single treated unit. Our goal is to be able to evaluate counterfactual adaptive policies after data collection and to estimate structural parameters such as dynamic treatment effects, which can be used for credit assignment (e.g. what was the effect of the first period action on the final outcome). Such parameters of interest can be framed as solutions to moment equations, but not minimizers of a population loss function, leading to Z-estimation approaches in the case of static data. However, such estimators fail to be asymptotically normal in the case of adaptive data collection. We propose a re-weighted Z-estimation approach with carefully designed adaptive weights to stabilize the episode-varying estimation variance, which results from the nonstationary policy that typical episodic RL algorithms invoke. We identify proper weighting schemes to restore the consistency and asymptotic normality of the re-weighted Z-estimators for target parameters, which allows for hypothesis testing and constructing reliable confidence regions for target parameters of interest. Primary applications include dynamic treatment effect estimation and dynamic off-policy evaluation.
翻译:我们考虑对从偶发强化学习(RL)算法(即适应性实验算法)中收集的数据进行估计和推断;这种算法在每一时期(卡事件)以顺序方式与单一处理单位进行多次互动;我们的目标是在数据收集后能够评价反事实适应政策,并估计结构参数,例如动态处理效应,这种动态处理效应可用于信用分配(例如,第一阶段行动对最终结果的影响);这些利害参数可以作为瞬间方程的解决方案,而不是人口损失函数的最小化因素,从而导致静态数据的Z估计方法。然而,在收集适应性数据时,这种估计没有同样地正常。我们提议重新加权Z估计法,它经过仔细设计的调整权重,以稳定事件变差估计差异,这是典型的SindiveRL算法所援引的非静态政策的结果。我们确定了适当的加权办法,以恢复静态数据情况下的Z估计方法的一致性和适中性。但是,在收集适应性数据收集时,这种估测算法无法同时考虑是否正常。我们提议一种重度的Z估测算法,以便进行动态测测测测测测测的参数。