This paper provides the first formalisation and empirical demonstration of a particular safety concern in reinforcement learning (RL)-based news and social media recommendation algorithms. This safety concern is what we call "user tampering" -- a phenomenon whereby an RL-based recommender system may manipulate a media user's opinions, preferences and beliefs via its recommendations as part of a policy to increase long-term user engagement. We provide a simulation study of a media recommendation problem constrained to the recommendation of political content, and demonstrate that a Q-learning algorithm consistently learns to exploit its opportunities to 'polarise' simulated 'users' with its early recommendations in order to have more consistent success with later recommendations catering to that polarisation. Finally, we argue that given our findings, designing an RL-based recommender system which cannot learn to exploit user tampering requires making the metric for the recommender's success independent of observable signals of user engagement, and thus that a media recommendation system built solely with RL is necessarily either unsafe, or almost certainly commercially unviable.
翻译:本文首次正式和实证地展示了强化学习(RL)基于新闻和社会媒体建议算法中特别的安全关切。 这种安全关切是我们所谓的“用户篡改”现象,即基于RL的建议系统可以通过其建议来操纵媒体用户的意见、偏好和信仰,以此作为增加长期用户参与的政策的一部分。我们模拟研究了受政治内容建议限制的媒体建议问题,并表明一个Q-学习算法一贯学习如何利用它的机会,利用它早期的建议“Polarise”模拟用户,以便更一致地成功落实后来关于两极化的建议。 最后,我们指出,根据我们的调查结果,设计基于RL的建议系统,无法利用用户篡改的系统,需要根据用户参与的可观察信号,对建议者的成功进行衡量,因此,仅与RL建立的媒体建议系统必然不安全,或几乎在商业上是行不通的。