We consider the contextual bandit problem on general action and context spaces, where the learner's rewards depend on their selected actions and an observable context. This generalizes the standard multi-armed bandit to the case where side information is available, e.g., patients' records or customers' history, which allows for personalized treatment. We focus on consistency -- vanishing regret compared to the optimal policy -- and show that for large classes of non-i.i.d. contexts, consistency can be achieved regardless of the time-invariant reward mechanism, a property known as universal consistency. Precisely, we first give necessary and sufficient conditions on the context-generating process for universal consistency to be possible. Second, we show that there always exists an algorithm that guarantees universal consistency whenever this is achievable, called an optimistically universal learning rule. Interestingly, for finite action spaces, learnable processes for universal learning are exactly the same as in the full-feedback setting of supervised learning, previously studied in the literature. In other words, learning can be performed with partial feedback without any generalization cost. The algorithms balance a trade-off between generalization (similar to structural risk minimization) and personalization (tailoring actions to specific contexts). Lastly, we consider the case of added continuity assumptions on rewards and show that these lead to universal consistency for significantly larger classes of data-generating processes.
翻译:我们考虑了一般行动和背景空间的背景土匪问题,学习者的报酬取决于他们选择的行动和可观察的环境。这把标准的多武装土匪概括到可以获得侧面信息的情况下,例如病人的记录或客户的历史,允许个性化治疗。我们注重一致性 -- -- 与最佳政策相比,失去遗憾 -- -- 并表明对于大量非一.d.背景而言,一致性可以实现,而不管时间差异性奖励机制,一种被称为普遍一致性的财产。确切地说,我们首先为背景生成过程创造必要和充分的条件,以便实现普遍一致性。第二,我们表明始终存在着一种算法,只要能够做到,就能保证普遍一致性,要求一种乐观的普遍学习规则。有趣的是,对于有限的行动空间,可学习的普遍学习过程与先前在文献中研究的、监督性学习的完整后退设置完全相同。换句话说,学习可以部分反馈,而无需任何一般化成本。算法平衡了一般化(类似于结构风险)和个体化过程之间的贸易(我们最终考虑更大规模的结构风险)以及个人化过程的连续化。