电子商务搜索中的用户意图探测和嵌入检索培训前任务 (Pre-training Tasks for User Intent Detection and Embedding Retrieval in E-commerce Search)

BERT-style models pre-trained on the general corpus (e.g., Wikipedia) and fine-tuned on specific task corpus, have recently emerged as breakthrough techniques in many NLP tasks: question answering, text classification, sequence labeling and so on. However, this technique may not always work, especially for two scenarios: a corpus that contains very different text from the general corpus Wikipedia, or a task that learns embedding spacial distribution for a specific purpose (e.g., approximate nearest neighbor search). In this paper, to tackle the above two scenarios that we have encountered in an industrial e-commerce search system, we propose customized and novel pre-training tasks for two critical modules: user intent detection and semantic embedding retrieval. The customized pre-trained models after fine-tuning, being less than 10% of BERT-base's size in order to be feasible for cost-efficient CPU serving, significantly improve the other baseline models: 1) no pre-training model and 2) fine-tuned model from the official pre-trained BERT using general corpus, on both offline datasets and online system. We have open sourced our datasets for the sake of reproducibility and future works.

翻译：在一般材料(例如维基百科)上预先培训的、在具体任务材料上经过微调的BERT型模型最近成为许多NLP任务中的突破性技术:答答、文本分类、序列标签等等。然而,这一技术不一定总能奏效,特别是在两种情景下:包含与一般材料维基百科非常不同的文本,或者学习为特定目的嵌入和平分配(例如,近距离近距离搜索)的任务。本文针对工业电子商务搜索系统中我们遇到的上述两种情景,我们提出了两个关键模块的定制和新颖的培训前任务:用户意图检测和语义嵌入检索。经过微调后定制的预先培训模型,为低成本高效的CPU服务提供不及10%的实用性,大大改进了其他基线模型:1没有培训前期模型,2)利用离线数据集和在线系统对官方预先培训的BERT模型进行微调。我们为未来工作打开了数据源,为未来工作重新制作数据。

相关内容

MoDELS

关注 43

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

基于Transformer嵌入模型的个性化产品搜索，A Transformer-based Embedding Model for Personalized Product Search

专知会员服务

31+阅读 · 2020年5月20日

【ACL2020-Facebook AI】跨语言表示学习，Unsupervised Cross-lingual Representation Learning at Scale

专知会员服务

27+阅读 · 2020年4月5日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

166+阅读 · 2020年3月18日

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

专知会员服务

95+阅读 · 2020年3月12日