Our goal is to build general representation (embedding) for each user and each product item across Alibaba's businesses, including Taobao and Tmall which are among the world's biggest e-commerce websites. The representation of users and items has been playing a critical role in various downstream applications, including recommendation system, search, marketing, demand forecasting and so on. Inspired from the BERT model in natural language processing (NLP) domain, we propose a GUIM (General User Item embedding with Mixture of representation) model to achieve the goal with massive, structured, multi-modal data including the interactions among hundreds of millions of users and items. We utilize mixture of representation (MoR) as a novel representation form to model the diverse interests of each user. In addition, we use the InfoNCE from contrastive learning to avoid intractable computational costs due to the numerous size of item (token) vocabulary. Finally, we propose a set of representative downstream tasks to serve as a standard benchmark to evaluate the quality of the learned user and/or item embeddings, analogous to the GLUE benchmark in NLP domain. Our experimental results in these downstream tasks clearly show the comparative value of embeddings learned from our GUIM model.
翻译:我们的目标是为Alibaba的每个用户和每个产品项目,包括世界最大电子商务网站中的Taobao和Tmall等全球最大电子商务网站,建立通用代表制。用户和项目的代表制一直在包括建议系统、搜索、营销、需求预测等在内的各种下游应用中发挥着关键作用。从自然语言处理(NLP)域的BERT模式出发,我们提出了一套具有代表性的下游任务模式(普通用户项目与代表制混合),以大规模、结构化、多模式化的数据实现目标,包括数亿用户和项目之间的互动。我们使用混合代表制(MoR)作为新型代表制形式,以模拟每个用户的不同利益。此外,我们利用对比性学习的InfocNCE避免由于项目(token)词汇的大小众多而导致的难以解决的计算费用。最后,我们提出了一套具有代表性的下游任务,作为标准基准,用以评估所学用户和(或)项目的嵌入质量,类似于NLP域GUE模型的基准。我们从GIG中学习的下游任务中的实验性结果清楚地展示了这些下游任务。