Despite the success of vision-language models in various generative tasks, obtaining high-quality semantic representations for products and user intents is still challenging due to the inability of off-the-shelf models to capture nuanced relationships between the entities. In this paper, we introduce a joint training framework for product and user queries by aligning uni-modal and multi-modal encoders through contrastive learning on image-text data. Our novel approach trains a query encoder with an LLM-curated relevance dataset, eliminating the reliance on engagement history. These embeddings demonstrate strong generalization capabilities and improve performance across applications, including product categorization and relevance prediction. For personalized ads recommendation, a significant uplift in the click-through rate and conversion rate after the deployment further confirms the impact on key business metrics. We believe that the flexibility of our framework makes it a promising solution toward enriching the user experience across the e-commerce landscape.
翻译:尽管视觉语言模型在各种生成任务中取得了成功,但由于现成模型难以捕捉实体间的细微关系,获取高质量的产品和用户意图语义表示仍具挑战性。本文提出了一种针对产品和用户查询的联合训练框架,通过在图像-文本数据上进行对比学习,对齐单模态与多模态编码器。我们的创新方法利用LLM策划的相关性数据集训练查询编码器,消除了对用户交互历史的依赖。这些嵌入表现出强大的泛化能力,并在产品分类和相关性预测等应用中提升了性能。在个性化广告推荐场景中,部署后点击率和转化率的显著提升进一步证实了其对关键业务指标的影响。我们相信,该框架的灵活性使其成为提升电子商务领域用户体验的潜在解决方案。