具有强盗回馈的等级式对话优先 (Hierarchical Conversational Preference Elicitation with Bandit Feedback)

The recent advances of conversational recommendations provide a promising way to efficiently elicit users' preferences via conversational interactions. To achieve this, the recommender system conducts conversations with users, asking their preferences for different items or item categories. Most existing conversational recommender systems for cold-start users utilize a multi-armed bandit framework to learn users' preference in an online manner. However, they rely on a pre-defined conversation frequency for asking about item categories instead of individual items, which may incur excessive conversational interactions that hurt user experience. To enable more flexible questioning about key-terms, we formulate a new conversational bandit problem that allows the recommender system to choose either a key-term or an item to recommend at each round and explicitly models the rewards of these actions. This motivates us to handle a new exploration-exploitation (EE) trade-off between key-term asking and item recommendation, which requires us to accurately model the relationship between key-term and item rewards. We conduct a survey and analyze a real-world dataset to find that, unlike assumptions made in prior works, key-term rewards are mainly affected by rewards of representative items. We propose two bandit algorithms, Hier-UCB and Hier-LinUCB, that leverage this observed relationship and the hierarchical structure between key-terms and items to efficiently learn which items to recommend. We theoretically prove that our algorithm can reduce the regret bound's dependency on the total number of items from previous work. We validate our proposed algorithms and regret bound on both synthetic and real-world data.

翻译：最近对口建议的进展为通过对口互动有效获取用户偏好提供了一个有希望的途径。为了实现这一点,建议系统与用户进行对话,要求用户选择不同项目或项目类别的偏好。大多数现有的对口建议系统都使用多武装匪徒框架,以在线方式了解用户的偏好。然而,它们依赖预先定义的对口频率来询问项目类别,而不是个别项目,这可能带来过多的对口互动,从而损害用户的经验。为了能够更灵活地询问关键术语,我们设计了一个新的对口强盗问题,使建议系统能够选择关键术语或项目,在每个回合和明确的模型中建议这些行动的奖赏。这促使我们处理一个新的探索-开发(EE)关键术语询问与项目建议之间的交易。但是,它们需要我们精确地模拟关键术语和项目之间的关系,从而可能损害用户的体验。我们进行一项调查和分析真实世界数据集,以便发现与先前工作中所作的假设不同,关键术语的奖赏主要受到具有代表性的项目的奖赏。我们提出了两个粗略度、高层次数据、高层次数据序列项目,我们学习了最终的排序项目。