Learning user sequence behaviour embedding is very sophisticated and challenging due to the complicated feature interactions over time and high dimensions of user features. Recent emerging foundation models, e.g., BERT and its variants, encourage a large body of researchers to investigate in this field. However, unlike natural language processing (NLP) tasks, the parameters of user behaviour model come mostly from user embedding layer, which makes most existing works fail in training a universal user embedding of large scale. Furthermore, user representations are learned from multiple downstream tasks, and the past research work do not address the seesaw phenomenon. In this paper, we propose SUPERMOE, a generic framework to obtain high quality user representation from multiple tasks. Specifically, the user behaviour sequences are encoded by MoE transformer, and we can thus increase the model capacity to billions of parameters, or even to trillions of parameters. In order to deal with seesaw phenomenon when learning across multiple tasks, we design a new loss function with task indicators. We perform extensive offline experiments on public datasets and online experiments on private real-world business scenarios. Our approach achieves the best performance over state-of-the-art models, and the results demonstrate the effectiveness of our framework.
翻译:由于长期复杂的特征互动和用户特征的高度特点,学习用户序列嵌入过程非常复杂和具有挑战性。最近出现的基础模型,例如BERT及其变体,鼓励大量研究人员在这一领域进行调查。然而,与自然语言处理(NLP)任务不同,用户行为模型的参数主要来自用户嵌入层,这使得大多数现有工程无法培训大规模普遍用户嵌入过程。此外,用户的表述是从多个下游任务中学习的,而过去的研究工作没有解决锯木现象。在本文件中,我们提议SUPERMOE是一个通用框架,从多重任务中获得高质量的用户代表。具体地说,用户行为序列是由MOE变异器编码的,因此我们可以将模型能力提高到数十亿个参数,甚至到万亿个参数。为了在学习多个任务时处理视觉现象,我们设计了一个新的损失函数。我们对公共数据集进行了广泛的离线实验,对私人现实世界商业情景进行了在线实验。我们的方法是在状态框架上取得最佳的绩效,展示了我们的模型和结果。