" 慈善捐助调查实验:内实验结果与政策学习 " 中的背景强盗 (Contextual Bandits in a Survey Experiment on Charitable Giving: Within-Experiment Outcomes versus Policy Learning)

We design and implement an adaptive experiment (a ``contextual bandit'') to learn a targeted treatment assignment policy, where the goal is to use a participant's survey responses to determine which charity to expose them to in a donation solicitation. The design balances two competing objectives: optimizing the outcomes for the subjects in the experiment (``cumulative regret minimization'') and gathering data that will be most useful for policy learning, that is, for learning an assignment rule that will maximize welfare if used after the experiment (``simple regret minimization''). We evaluate alternative experimental designs by collecting pilot data and then conducting a simulation study. Next, we implement our selected algorithm. Finally, we perform a second simulation study anchored to the collected data that evaluates the benefits of the algorithm we chose. Our first result is that the value of a learned policy in this setting is higher when data is collected via a uniform randomization rather than collected adaptively using standard cumulative regret minimization or policy learning algorithms. We propose a simple heuristic for adaptive experimentation that improves upon uniform randomization from the perspective of policy learning at the expense of increasing cumulative regret relative to alternative bandit algorithms. The heuristic modifies an existing contextual bandit algorithm by (i) imposing a lower bound on assignment probabilities that decay slowly so that no arm is discarded too quickly, and (ii) after adaptively collecting data, restricting policy learning to select from arms where sufficient data has been gathered.

翻译：我们设计并执行一项适应性实验(“Contextulual bridit'),以学习有针对性的治疗派派政策,目的是利用参与者的调查答复,确定哪些慈善组织在捐款招标中暴露于这些慈善组织。设计平衡了两个相互竞争的目标:为实验对象优化实验(“累积遗憾最小化”)的结果,并收集最有利于政策学习的数据,即学习一项任务分配规则,如果实验后使用,将福利最大化(“简单遗憾最小化 ” ) 。我们通过收集试点数据,然后进行模拟研究,来评估替代性实验派任政策设计。接下来,我们执行我们选定的算法。最后,我们根据所收集的数据进行第二次模拟研究,评估算法的好处。我们的第一个结果是,如果数据是通过统一的随机化收集的,而不是使用标准的累积最小化或政策学习算法来收集的,那么,在这个环境中学习的学习政策的价值就会更高。我们建议一种简单的适应性实验,从统一随机化的角度,从政策学习的视角来改进。我们执行我们选择的算算算法,其代价是增加相对于替代的累进式弹性度,选择弹性度定度定型算法,因此,这种算算法是快速地进行一个压低的。将逐渐地算算取一个压轴式的, 将一个压低式式式的将一个现有的算法,将一个压轴式政策。