自我监督的大规模项目学习建议 (Self-supervised Learning for Large-scale Item Recommendations)

Large scale recommender models find most relevant items from huge catalogs, and they play a critical role in modern search and recommendation systems. To model the input space with large-vocab categorical features, a typical recommender model learns a joint embedding space through neural networks for both queries and items from user feedback data. However, with millions to billions of items in the corpus, users tend to provide feedback for a very small set of them, causing a power-law distribution. This makes the feedback data for long-tail items extremely sparse. Inspired by the recent success in self-supervised representation learning research in both computer vision and natural language understanding, we propose a multi-task self-supervised learning (SSL) framework for large-scale item recommendations. The framework is designed to tackle the label sparsity problem by learning better latent relationship of item features. Specifically, SSL improves item representation learning as well as serving as additional regularization to improve generalization. Furthermore, we propose a novel data augmentation method that utilizes feature correlations within the proposed framework. We evaluate our framework using two real-world datasets with 500M and 1B training examples respectively. Our results demonstrate the effectiveness of SSL regularization and show its superior performance over the state-of-the-art regularization techniques. We also have already launched the proposed techniques to a web-scale commercial app-to-app recommendation system, with significant improvements top-tier business metrics demonstrated in A/B experiments on live traffic. Our online results also verify our hypothesis that our framework indeed improves model performance even more on slices that lack supervision.

翻译：大型推荐人模型从庞大的目录中找到最相关的项目,它们在现代搜索和建议系统中发挥着关键作用。为了模拟具有大量语音绝对特征的输入空间,典型推荐人模型通过神经网络为查询和用户反馈数据的项目学习一个联合嵌入空间。然而,由于该数据库中有数百万至数十亿个项目,用户往往为非常小的一组项目提供反馈,造成权力法分配。这使得长尾项目的反馈数据极为稀少。由于计算机视觉和自然语言理解方面的自我监督代表学习研究最近取得成功,我们为大规模项目建议提出了一个多任务自我监督学习(SSL)框架。这个框架的目的是通过学习项目特征的更深层关系来解决标签紧张问题。具体地说,SSLSL改进了项目代表学习,以及进一步规范。此外,我们提出了一种新的数据增强方法,在拟议框架内利用实际数据模型对计算机视野和自然语言理解进行自我监督的学习研究,我们用两个实际操作的自动监督学习框架对大型项目建议进行了评估,我们也用500M和1B的升级的升级的校正化技术来展示了我们所推出的高级系统的业绩。