学习探索用户的最佳建议 (Learning the Optimal Recommendation from Explorative Users)

We propose a new problem setting to study the sequential interactions between a recommender system and a user. Instead of assuming the user is omniscient, static, and explicit, as the classical practice does, we sketch a more realistic user behavior model, under which the user: 1) rejects recommendations if they are clearly worse than others; 2) updates her utility estimation based on rewards from her accepted recommendations; 3) withholds realized rewards from the system. We formulate the interactions between the system and such an explorative user in a $K$-armed bandit framework and study the problem of learning the optimal recommendation on the system side. We show that efficient system learning is still possible but is more difficult. In particular, the system can identify the best arm with probability at least $1-\delta$ within $O(1/\delta)$ interactions, and we prove this is tight. Our finding contrasts the result for the problem of best arm identification with fixed confidence, in which the best arm can be identified with probability $1-\delta$ within $O(\log(1/\delta))$ interactions. This gap illustrates the inevitable cost the system has to pay when it learns from an explorative user's revealed preferences on its recommendations rather than from the realized rewards.

翻译：我们提出一个新的问题设置,以研究推荐人系统和用户之间的相继互动。我们不认为用户是全方位、静态和清晰的,就像古典惯例一样,而是假设用户是全方位、静态和清晰的,而是勾画出一个更现实的用户行为模式,在此模式下,用户:(1) 如果建议明显比其他人差,则拒绝建议;(2) 根据她接受的建议的奖励,更新她的效用估算;(3) 保留从系统中获得的已实现的奖赏。我们用美元武装强盗框架来设计系统与这样一个探索性用户之间的互动,并研究在系统侧方学习最佳建议的问题。我们显示,高效的系统学习仍然是可能的,但难度更大。特别是,系统可以在O(1/delta)美元范围内确定概率至少为1美元/delta美元的最佳臂章,我们证明这一点是紧凑的。我们发现,在固定信心下,最好的手臂识别问题的结果是,在美元(log(1/\delta)美元)范围内以概率识别最佳的臂章能识别到1\delta美元之间的相互作用。这一差距表明,系统在从用户的偏好感中必然要付出的代价,而不是从现实的回报上从用户的偏好。