Machine Learning (ML) increasingly informs the allocation of opportunities to individuals and communities in areas such as lending, education, employment, and beyond. Such decisions often impact their subjects' future characteristics and capabilities in an a priori unknown fashion. The decision-maker, therefore, faces exploration-exploitation dilemmas akin to those in multi-armed bandits. Following prior work, we model communities as arms. To capture the long-term effects of ML-based allocation decisions, we study a setting in which the reward from each arm evolves every time the decision-maker pulls that arm. We focus on reward functions that are initially increasing in the number of pulls but may become (and remain) decreasing after a certain point. We argue that an acceptable sequential allocation of opportunities must take an arm's potential for growth into account. We capture these considerations through the notion of policy regret, a much stronger notion than the often-studied external regret, and present an algorithm with provably sub-linear policy regret for sufficiently long time horizons. We empirically compare our algorithm with several baselines and find that it consistently outperforms them, in particular for long time horizons.
翻译:机器学习(ML)日益向个人和社区提供贷款、教育、就业等领域的机会,这种决定往往以不为人知的方式影响其主体未来的特点和能力。因此,决策者面临类似于多武装匪徒的探索-剥削困境。在以前的工作之后,我们以武器来模拟社区。要捕捉以ML为基础的分配决定的长期影响,我们研究每个手臂的奖赏每次决策者拉动手臂时都会演变的环境。我们注重奖赏功能,这些奖赏最初增加的拉动数量,但在某一点后可能会(和继续)减少。我们主张,可接受的连续分配机会必须考虑到一个手臂的增长潜力。我们通过政策遗憾的概念来捕捉这些考虑因素,这个概念比经常被研究的外部悔恨要强得多,并且提出一种在足够长的时间跨度上具有可辨知的子线政策悔恨的算法。我们从经验上将我们的算法与若干基线进行比较,发现它一贯超越这些基准,特别是在很长的时间跨度上。