We consider the "policy choice" problem -- otherwise known as best arm identification in the bandit literature -- proposed by Kasy and Sautmann (2021) for adaptive experimental design. Theorem 1 of Kasy and Sautmann (2021) provides three asymptotic results that give theoretical guarantees for exploration sampling developed for this setting. We first show that the proof of Theorem 1 (1) has technical issues, and the proof and statement of Theorem 1 (2) are incorrect. We then show, through a counterexample, that Theorem 1 (3) is false. For the former two, we correct the statements and provide rigorous proofs. For Theorem 1 (3), we propose an alternative objective function, which we call posterior weighted policy regret, and derive the asymptotic optimality of exploration sampling.
翻译:我们考虑了“政策选择”问题,即卡西和沙特曼(2021年)为适应性实验设计提议的“最佳手臂识别”问题。卡西和沙特曼(2021年)的理论1提供了三种无症状结果,为为为这一环境开发的勘探取样提供了理论保障。我们首先表明,理论1(1)的证据有技术问题,理论1(2)的证明和陈述不正确。然后,我们通过反证,表明理论1(3)是虚假的。我们纠正了前两个理论并提供严格的证据。对于理论1(3),我们提出了一个替代目标功能,我们称之为后置政策偏重后置,并得出勘探取样的无症状最佳性。