模拟建议一致问题 (Modelling the Recommender Alignment Problem)

Recommender systems (RS) mediate human experience online. Most RS act to optimize metrics that are imperfectly aligned with the best-interest of users but are easy to measure, like ad-clicks and user engagement. This has resulted in a host of hard-to-measure side-effects: political polarization, addiction, fake news. RS design faces a recommender alignment problem: that of aligning recommendations with the goals of users, system designers, and society as a whole. But how do we test and compare potential solutions to align RS? Their massive scale makes them costly and risky to test in deployment. We synthesized a simple abstract modelling framework to guide future work. To illustrate it, we construct a toy experiment where we ask: "How can we evaluate the consequences of using user retention as a reward function?" To answer the question, we learn recommender policies that optimize reward functions by controlling graph dynamics on a toy environment. Based on the effects that trained recommenders have on their environment, we conclude that engagement maximizers generally lead to worse outcomes than aligned recommenders but not always. After learning, we examine competition between RS as a potential solution to RS alignment. We find that it generally makes our toy-society better-off than it would be under the absence of recommendation or engagement maximizers. In this work, we aimed for a broad scope, touching superficially on many different points to shed light on how an end-to-end study of reward functions for recommender systems might be done. Recommender alignment is a pressing and important problem. Attempted solutions are sure to have far-reaching impacts. Here, we take a first step in developing methods to evaluating and comparing solutions with respect to their impacts on society.

翻译：推荐者系统(RS) 在网上调解人类经验。多数RS公司采取行动, 优化与用户最佳利益不完全一致但容易测量的衡量标准, 比如广告点击和用户参与。这导致了一系列难以衡量的副作用: 政治两极化、吸毒成瘾、假新闻。 RS公司的设计面临一个建议性调整问题: 将建议与用户、系统设计者和全社会的目标保持一致。但是, 我们如何测试和比较潜在的解决方案来统一RS? 它们的大规模规模使得它们成本高、风险大, 从而在部署中测试。我们综合了一个简单的抽象建模框架来指导未来工作。为了说明这一点, 我们设计了一个玩具实验, 我们问道:“ 我们如何评估使用用户保留作为奖励功能的后果? 为了回答这个问题, 我们学习了一个建议性的政策, 通过控制重力环境中的图形动态来优化奖赏功能。基于经过培训的推荐者对其环境的影响, 我们的结论是, 参与程度最大化通常会导致比推荐者更糟糕的结果,但并不总是。在学习后, 我们检查RS公司之间的竞争, 将其作为一个潜在的解决方案, 一种更直接的解决方案, 而不是对RS公司的广泛排序。我们一般地评估一个工作范围。