This paper provides novel formal methods and empirical demonstrations of a particular safety concern in reinforcement learning (RL)-based recommendation algorithms. We call this safety concern `user tampering' -- a phenomenon whereby an RL-based recommender system might manipulate a media user's opinions via its recommendations as part of a policy to increase long-term user engagement. We then apply techniques from causal modelling to analyse the leading approaches in the literature for implementing scalable RL-based recommenders, and we observe that the current approaches permit user tampering. Additionally, we review the existing mitigation strategies for reward tampering problems and show that they do not transfer well to the user tampering phenomenon found in the recommendation context. Furthermore, we provide a simulation study of a media RL-based recommendation problem constrained to the recommendation of political content. We show that a Q-learning algorithm consistently learns to exploit its opportunities to polarise simulated users with its early recommendations in order to have more consistent success with later recommendations catering to that polarisation. This latter contribution calls for urgency in designing safer RL-based recommenders; the former suggests that creating such safe recommenders will require a fundamental shift in design away from the approaches we have seen in the recent literature.
翻译:本文提供了新颖的正式方法和在强化学习(RL)基于建议算法中特别安全关切的经验示范。我们称这种安全关切为“用户篡改”现象,即基于RL的建议系统可能通过其建议操纵媒体用户的意见,以此作为提高长期用户参与程度的政策的一部分。然后,我们运用因果建模技术分析文献中的主要方法,以实施基于RL的可扩缩建议,我们发现目前的方法允许用户篡改。此外,我们审查现有的奖励篡改问题缓解战略,并表明这些战略没有很好地转移给在建议中发现的用户篡改现象。此外,我们提供媒体RL建议系统模拟研究,但限于政治内容的建议。我们表明,基于Q学习的算法不断学会利用其机会,用早期的建议对模拟用户进行极化,以便更一致地成功地落实后来提出的有关极化的建议。后一种方法要求紧迫地设计更安全的基于RL的建议;前一种方法表明,创建这种安全建议者需要从我们所看到的文献中彻底改变设计方法。