Recent advances in reinforcement learning have inspired increasing interest in learning user modeling adaptively through dynamic interactions, e.g., in reinforcement learning based recommender systems. Reward function is crucial for most of reinforcement learning applications as it can provide the guideline about the optimization. However, current reinforcement-learning-based methods rely on manually-defined reward functions, which cannot adapt to dynamic and noisy environments. Besides, they generally use task-specific reward functions that sacrifice generalization ability. We propose a generative inverse reinforcement learning for user behavioral preference modelling, to address the above issues. Instead of using predefined reward functions, our model can automatically learn the rewards from user's actions based on discriminative actor-critic network and Wasserstein GAN. Our model provides a general way of characterizing and explaining underlying behavioral tendencies, and our experiments show our method outperforms state-of-the-art methods in a variety of scenarios, namely traffic signal control, online recommender systems, and scanpath prediction.
翻译:最近加强学习的进展激发了人们越来越有兴趣通过动态互动(例如,加强基于学习的推荐人系统)来学习用户的适应性建模。奖励功能对于大多数强化学习应用程序至关重要,因为它可以提供优化的指导。然而,目前的强化学习方法依靠人工定义的奖励功能,不能适应动态和吵闹的环境。此外,它们通常使用牺牲概括能力的特有奖励功能。我们提议为用户行为偏好模型提供反向强化学习,以解决上述问题。我们的模式可以不使用预先定义的奖励功能,而是自动从基于歧视性的行为者-批评网络和瓦塞斯坦GAN的用户行动中学习奖赏。我们的模型提供了一种概括的描述和解释基本行为倾向的方法,我们的实验展示了我们的方法在各种情景中(即交通信号控制、在线推荐人系统和扫描路径预测)的优异状态方法。