When an Agent visits a platform recommending a menu of content to select from, their choice of item depends not only on fixed preferences, but also on their prior engagements with the platform. The Recommender's primary objective is typically to encourage content consumption which optimizes some reward, such as ad revenue, but they often also aim to ensure that a wide variety of content is consumed by the Agent over time. We formalize this problem as an adversarial bandit task. At each step, the Recommender presents a menu of $k$ (out of $n$) items to the Agent, who selects one item in the menu according to their unknown preference model, which maps their history of past items to relative selection probabilities. The Recommender then observes the Agent's chosen item and receives bandit feedback of the item's reward. In addition to optimizing reward from selected items, the Recommender must also ensure that the total distribution of chosen items has sufficiently high entropy. We define a class of preference models which are locally learnable, i.e. behavior over the entire domain can be estimated by only observing behavior in a small region; this includes models representable by bounded-degree polynomials as well as functions with a sparse Fourier basis. For this class, we give an algorithm for the Recommender which obtains $\tilde{O}(T^{3/4})$ regret against all item distributions satisfying two conditions: they are sufficiently diversified, and they are instantaneously realizable at any history by some distribution over menus. We show that these conditions are closely connected: all sufficiently high-entropy distributions are instantaneously realizable at any item history. We also give a set of negative results justifying our assumptions, in the form of a runtime lower bound for non-local learning and linear regret lower bounds for alternate benchmarks.
翻译:当一个代理访问一个平台, 建议内容菜单从中选择时, 他们的项目选择不仅取决于固定的偏好, 也取决于他们之前与平台的接触。 推荐人的主要目的通常是鼓励内容消费, 优化某些奖赏, 如广告收入等, 但通常还旨在确保代理人长期消费各种各样的内容。 我们将此问题正式化为对抗性土匪任务。 在每一个步骤中, 推荐人向代理人展示一个菜单, 菜单中包括$( 用美元来选择) 的项目, 他们不仅选择固定的偏好, 而且还取决于他们先前与平台的交往。 推荐人的主要目的通常是鼓励内容消费消费, 这通常是为了鼓励内容消费, 例如广告中选择的项目, 但是除了优化某些物品的奖赏之外, 推荐人还必须确保所选项目的总分布足够高的金质。 我们为本地学习的任何优惠模式, 也就是说, 连接整个域的行为只能通过观察一个小区域中的不同动作来估计, 将过去项目的历史历史历史历史历史历史历史历史历史历史记录 和项目反馈反馈 。 这个模型在四度上可以代表一个固定的磁度矩阵 。