We consider the "policy choice" problem -- otherwise known as best arm identification in the bandit literature -- proposed by Kasy and Sautmann (2021) for adaptive experimental design. Theorem 1 of Kasy and Sautmann (2021) provides three asymptotic results that give theoretical guarantees for exploration sampling developed for this setting. We first show that the proof of Theorem 1 (1) has technical issues, and the proof and statement of Theorem 1 (2) are incorrect. We then show, through a counterexample, that Theorem 1 (3) is false. For the former two, we correct the statements and provide rigorous proofs. For Theorem 1 (3), we propose an alternative objective function, which we call posterior weighted policy regret, and derive its asymptotic optimality.
翻译:我们考虑了“政策选择”问题,即卡西和沙特曼(2021年)为适应性实验设计提议的“最佳手臂识别”问题。卡西和沙特曼(2021年)的理论1提供了三种无症状结果,为为为这一环境开发的勘探取样提供了理论保障。我们首先表明,理论1(1)的证据有技术问题,而理论1(2)的证明和陈述不正确。然后,我们通过反例表明,理论1(3)是虚假的。对于前两个,我们纠正了这些陈述并提供严格的证据。对于理论1(3),我们提出了一个替代目标功能,我们称之为后置政策偏重遗憾,并得出其无症状的最佳性。