Large language models (LLMs) have shown potential in recommendation systems (RecSys) by using them as either knowledge enhancer or zero-shot ranker. A key challenge lies in the large semantic gap between LLMs and RecSys where the former internalizes language world knowledge while the latter captures personalized world of behaviors. Unfortunately, the research community lacks a comprehensive benchmark that evaluates the LLMs over their limitations and boundaries in RecSys so that we can draw a confident conclusion. To investigate this, we propose a benchmark named LRWorld containing over 38K high-quality samples and 23M tokens carefully compiled and generated from widely used public recommendation datasets. LRWorld categorizes the mental world of LLMs in RecSys as three main scales (association, personalization, and knowledgeability) spanned by ten factors with 31 measures (tasks). Based on LRWorld, comprehensive experiments on dozens of LLMs show that they are still not well capturing the deep neural personalized embeddings but can achieve good results on shallow memory-based item-item similarity. They are also good at perceiving item entity relations, entity hierarchical taxonomies, and item-item association rules when inferring user interests. Furthermore, LLMs show a promising ability in multimodal knowledge reasoning (movie poster and product image) and robustness to noisy profiles. None of them show consistently good performance over the ten factors. Model sizes, position bias, and more are ablated.
翻译:大型语言模型(LLMs)在推荐系统(RecSys)中展现出潜力,可被用作知识增强器或零样本排序器。关键挑战在于LLMs与RecSys之间存在巨大的语义鸿沟:前者内化了语言世界知识,而后者捕捉了个性化的行为世界。遗憾的是,研究领域缺乏一个全面的基准来评估LLMs在推荐系统中的局限与边界,使我们难以得出明确结论。为此,我们提出了名为LRWorld的基准数据集,包含超过3.8万个高质量样本和2300万词元,均从广泛使用的公开推荐数据集中精心整理生成。LRWorld将LLMs在推荐系统中的心智世界划分为三个主要维度(关联性、个性化与知识性),涵盖十个因子和三十一项测量指标(任务)。基于LRWorld对数十个LLMs的全面实验表明:当前LLMs仍未能充分捕捉深度神经个性化嵌入特征,但在基于浅层记忆的物品相似性任务上表现良好;在推断用户兴趣时,LLMs能有效感知物品实体关系、实体层次分类和物品关联规则;此外,LLMs在多模态知识推理(电影海报与商品图像)和噪声画像鲁棒性方面展现出潜力。所有模型在十个因子上均未表现出持续优异的性能。研究还对模型规模、位置偏差等因素进行了消融分析。