BERT-style models pre-trained on the general corpus (e.g., Wikipedia) and fine-tuned on specific task corpus, have recently emerged as breakthrough techniques in many NLP tasks: question answering, text classification, sequence labeling and so on. However, this technique may not always work, especially for two scenarios: a corpus that contains very different text from the general corpus Wikipedia, or a task that learns embedding spacial distribution for a specific purpose (e.g., approximate nearest neighbor search). In this paper, to tackle the above two scenarios that we have encountered in an industrial e-commerce search system, we propose customized and novel pre-training tasks for two critical modules: user intent detection and semantic embedding retrieval. The customized pre-trained models after fine-tuning, being less than 10% of BERT-base's size in order to be feasible for cost-efficient CPU serving, significantly improve the other baseline models: 1) no pre-training model and 2) fine-tuned model from the official pre-trained BERT using general corpus, on both offline datasets and online system. We have open sourced our datasets for the sake of reproducibility and future works.
翻译:在一般材料(例如维基百科)上预先培训的、在具体任务材料上经过微调的BERT型模型最近成为许多NLP任务中的突破性技术:答答、文本分类、序列标签等等。然而,这一技术不一定总能奏效,特别是在两种情景下:包含与一般材料维基百科非常不同的文本,或者学习为特定目的嵌入和平分配(例如,近距离近距离搜索)的任务。本文针对工业电子商务搜索系统中我们遇到的上述两种情景,我们提出了两个关键模块的定制和新颖的培训前任务:用户意图检测和语义嵌入检索。经过微调后定制的预先培训模型,为低成本高效的CPU服务提供不及10%的实用性,大大改进了其他基线模型:1没有培训前期模型,2)利用离线数据集和在线系统对官方预先培训的BERT模型进行微调。我们为未来工作打开了数据源,为未来工作重新制作数据。