We consider an online learning problem with one-sided feedback, in which the learner is able to observe the true label only for positively predicted instances. On each round, $k$ instances arrive and receive classification outcomes according to a randomized policy deployed by the learner, whose goal is to maximize accuracy while deploying individually fair policies. We first extend the framework of Bechavod et al. (2020), which relies on the existence of a human fairness auditor for detecting fairness violations, to instead incorporate feedback from dynamically-selected panels of multiple, possibly inconsistent, auditors. We then construct an efficient reduction from our problem of online learning with one-sided feedback and a panel reporting fairness violations to the contextual combinatorial semi-bandit problem (Cesa-Bianchi & Lugosi, 2009, Gy\"{o}rgy et al., 2007). Finally, we show how to leverage the guarantees of two algorithms in the contextual combinatorial semi-bandit setting: Exp2 (Bubeck et al., 2012) and the oracle-efficient Context-Semi-Bandit-FTPL (Syrgkanis et al., 2016), to provide multi-criteria no regret guarantees simultaneously for accuracy and fairness. Our results eliminate two potential sources of bias from prior work: the "hidden outcomes" that are not available to an algorithm operating in the full information setting, and human biases that might be present in any single human auditor, but can be mitigated by selecting a well chosen panel.
翻译:我们考虑的是在线学习问题,有片面反馈,其中学习者只能对正面预测的事例观察真实标签。在每一回合中,每回合都有1美元案例,根据学习者随机部署的政策(目标是在部署个人公平政策的同时最大限度地提高准确性,目标是在部署个人公平政策的同时最大限度地提高准确性)获得分类结果。我们首先扩展Bechavod et al.(202020年)的框架,该框架依靠存在一个人类公平审计师来发现违反公平原则的行为,而不是纳入由多个(可能不一致的)审计师组成的动态选择小组的反馈。然后,我们通过单面反馈和小组报告对背景组合半宽带问题(Cesa-Bianchi & Lugosi,2009年;Gy\"{o}rgy et al.,2007年)进行高效地减少在线学习问题。最后,我们展示了如何在背景组合半宽带设置中利用两种算法的保证:Expl2 (Bubeck等人等人等人,2012年) 和“cle-effective Centrial-Sembididididition-T-FTPL (Syrick kani etal etal etal etal etal) (Syral supulent real resmlence,2016)) 提供多种可能的“不完全的逻辑结果,从我们以往工作结果, ral-ral