With the democratization of e-commerce platforms, an increasingly diversified user base is opting to shop online. To provide a comfortable and reliable shopping experience, it's important to enable users to interact with the platform in the language of their choice. An accurate query translation is essential for Cross-Lingual Information Retrieval (CLIR) with vernacular queries. Due to internet-scale operations, e-commerce platforms get millions of search queries every day. However, creating a parallel training set to train an in-domain translation model is cumbersome. This paper proposes an unsupervised domain adaptation approach to translate search queries without using any parallel corpus. We use an open-domain translation model (trained on public corpus) and adapt it to the query data using only the monolingual queries from two languages. In addition, fine-tuning with a small labeled set further improves the result. For demonstration, we show results for Hindi to English query translation and use mBART-large-50 model as the baseline to improve upon. Experimental results show that, without using any parallel corpus, we obtain more than 20 BLEU points improvement over the baseline while fine-tuning with a small 50k labeled set provides more than 27 BLEU points improvement over the baseline.
翻译:随着电子商务平台的民主化,一个日益多样化的用户基础正在选择在线购物。为了提供舒适和可靠的购物经验,必须使用户能够以自己选择的语言与平台互动。准确的查询翻译对于跨语言信息检索检索(CLIR)至关重要。由于互联网规模的运行,电子商务平台每天获得数百万次查询查询。然而,建立一个平行培训主页翻译模式的培训成套培训十分繁琐。本文件建议采用不受监督的域适应方法,在不使用任何平行文件的情况下翻译查询。我们使用开放域翻译模式(在公共电脑上培训),并仅使用两种语言的单语查询将其调整为查询数据。此外,用小标签设置的微调进一步改进了结果。为了演示,我们展示了印地语到英语的查询翻译结果,并使用MBARTART大50模型作为改进的基线。实验结果表明,在不使用任何平行材料的情况下,我们在基线上获得20多个BLEU点的改进,同时以比BEU小的改进率超过27个基点。