Nowadays, the mainstream approach in position allocation system is to utilize a reinforcement learning model to allocate appropriate locations for items in various channels and then mix them into the feed. There are two types of data employed to train reinforcement learning (RL) model for position allocation, named strategy data and random data. Strategy data is collected from the current online model, it suffers from an imbalanced distribution of state-action pairs, resulting in severe overestimation problems during training. On the other hand, random data offers a more uniform distribution of state-action pairs, but is challenging to obtain in industrial scenarios as it could negatively impact platform revenue and user experience due to random exploration. As the two types of data have different distributions, designing an effective strategy to leverage both types of data to enhance the efficacy of the RL model training has become a highly challenging problem. In this study, we propose a framework named Multi-Distribution Data Learning (MDDL) to address the challenge of effectively utilizing both strategy and random data for training RL models on mixed multi-distribution data. Specifically, MDDL incorporates a novel imitation learning signal to mitigate overestimation problems in strategy data and maximizes the RL signal for random data to facilitate effective learning. In our experiments, we evaluated the proposed MDDL framework in a real-world position allocation system and demonstrated its superior performance compared to the previous baseline. MDDL has been fully deployed on the Meituan food delivery platform and currently serves over 300 million users.
翻译:现在,位置分配系统的主流方法是利用强化学习模型,在各种频道中分配适当的物品位置,然后混合到新闻流中。用于训练位置分配的强化学习(RL)模型的数据有两种类型,称为策略数据和随机数据。策略数据是从当前在线模型收集的,它受到状态动作对分布不均衡的影响,在训练期间会导致严重的高估问题。另一方面,随机数据提供了更均匀的状态-动作分布,但在工业场景下很难获得,因为由于随机探索,可能会对平台收入和用户体验产生负面影响。由于这两种类型的数据具有不同的分布,设计一种有效的策略来利用这两种类型的数据,以增强混合多分布数据的RL模型训练的有效性已成为一个极具挑战性的问题。在本研究中,我们提出了一个名为多分布数据学习(MDDL)的框架,以应对有效利用策略和随机数据训练混合多分布数据上的RL模型的挑战。具体来说,MDDL引入了一种新颖的模仿学习信号,以减轻策略数据中的高估问题,并最大限度地利用随机数据的RL信号,以促进有效学习。在我们的实验中,我们在真实的位置分配系统中评估了所提出的MDDL框架,并证明了其相对于先前的基准的卓越性能。MDDL已在美团外卖平台上全面部署,目前服务于超过3亿用户。