Learning-to-defer is a framework to automatically defer decision-making to a human expert when ML-based decisions are deemed unreliable. Existing learning-to-defer frameworks are not designed for sequential settings. That is, they defer at every instance independently, based on immediate predictions, while ignoring the potential long-term impact of these interventions. As a result, existing frameworks are myopic. Further, they do not defer adaptively, which is crucial when human interventions are costly. In this work, we propose Sequential Learning-to-Defer (SLTD), a framework for learning-to-defer to a domain expert in sequential decision-making settings. Contrary to existing literature, we pose the problem of learning-to-defer as model-based reinforcement learning (RL) to i) account for long-term consequences of ML-based actions using RL and ii) adaptively defer based on the dynamics (model-based). Our proposed framework determines whether to defer (at each time step) by quantifying whether a deferral now will improve the value compared to delaying deferral to the next time step. To quantify the improvement, we account for potential future deferrals. As a result, we learn a pre-emptive deferral policy (i.e. a policy that defers early if using the ML-based policy could worsen long-term outcomes). Our deferral policy is adaptive to the non-stationarity in the dynamics. We demonstrate that adaptive deferral via SLTD provides an improved trade-off between long-term outcomes and deferral frequency on synthetic, semi-synthetic, and real-world data with non-stationary dynamics. Finally, we interpret the deferral decision by decomposing the propagated (long-term) uncertainty around the outcome, to justify the deferral decision.
翻译:学习到评估是一个框架,在基于 ML 的决定被认为不可靠时,自动将决策推迟给人类专家; 现有的学习到评估框架不是为顺序周期设计而设计的; 也就是说,根据即时预测,每个情况下都独立推迟,而忽视这些干预的潜在长期影响; 结果,现有框架是短视的。 此外, 在人类干预费用昂贵时,它们不会因地制宜地推迟,这是关键。 在这项工作中,我们提议按顺序排序的学习到筛选框架(SLTD),这是一个在顺序决策设置中向域专家学习以进行筛选的框架。 与现有的文献相反,我们提出了学习到评估的问题,作为基于模型的强化学习(RL)到i),考虑基于模型的干预行动的长期影响。 因此,现有框架不是根据动态(基于模型的)适应性推迟。 我们提出的长期框架确定是否推迟(每一步),通过量化我们目前的推迟将改善价值,而不是推迟到下一个时间步骤。 与现有文献相反,我们用基于模型的升级的政策结果,我们最终会算。