通过多武装强盗增加学生对提醒邮件的参与 (Increasing Students' Engagement to Reminder Emails Through Multi-Armed Bandits)

Conducting randomized experiments in education settings raises the question of how we can use machine learning techniques to improve educational interventions. Using Multi-Armed Bandits (MAB) algorithms like Thompson Sampling (TS) in adaptive experiments can increase students' chances of obtaining better outcomes by increasing the probability of assignment to the most optimal condition (arm), even before an intervention completes. This is an advantage over traditional A/B testing, which may allocate an equal number of students to both optimal and non-optimal conditions. The problem is the exploration-exploitation trade-off. Even though adaptive policies aim to collect enough information to allocate more students to better arms reliably, past work shows that this may not be enough exploration to draw reliable conclusions about whether arms differ. Hence, it is of interest to provide additional uniform random (UR) exploration throughout the experiment. This paper shows a real-world adaptive experiment on how students engage with instructors' weekly email reminders to build their time management habits. Our metric of interest is open email rates which tracks the arms represented by different subject lines. These are delivered following different allocation algorithms: UR, TS, and what we identified as TS{\dag} - which combines both TS and UR rewards to update its priors. We highlight problems with these adaptive algorithms - such as possible exploitation of an arm when there is no significant difference - and address their causes and consequences. Future directions includes studying situations where the early choice of the optimal arm is not ideal and how adaptive algorithms can address them.

翻译：在教育环境中随机进行实验提出了这样一个问题,即我们如何能够利用机器学习技术来改进教育干预。在适应性实验中使用Thompson抽样(TS)等多武装强盗算法(MAB)算法(MAB)来改进教育干预。在适应性实验中使用Thompson抽样(TS)等多武装强盗算法(MAB)算法(MAB)可以提高学生获得更好结果的机会,即使干预完成之前,也可以增加被分配到最优条件(武器)的最优条件(武器)的概率。这是传统的A/B测试的优势,这种测试可以将同等数量的学生分配到最佳和非最佳条件,问题在于探索-即使适应性政策旨在收集足够的信息,将更多的学生分配到更好的武器上,而过去的理想性工作表明,这也许不足以对武器是否不同作出可靠的结论。因此,在整个实验中,提供更多统一的随机随机(UR)勘探机会。本文展示了一个真实世界的适应性实验,学生如何与教师每周的电子邮件提醒来建立时间管理习惯。我们的兴趣衡量标准是用来追踪不同主题线所代表的武器差异的公开的电子邮件率。这些是不同的分配算法:UR(UR、TSTS)处理方式,而不是我们所查明的早期的升级原因。我们所发现的是可能的。