With the broad reach of the internet and smartphones, e-commerce platforms have an increasingly diversified user base. Since native language users are not conversant in English, their preferred browsing mode is their regional language or a combination of their regional language and English. From our recent study on the query data, we noticed that many of the queries we receive are code-mix, specifically Hinglish i.e. queries with one or more Hindi words written in English (Latin) script. We propose a transformer-based approach for code-mix query translation to enable users to search with these queries. We demonstrate the effectiveness of pre-trained encoder-decoder models trained on a large corpus of the unlabeled English text for this task. Using generic domain translation models, we created a pseudo-labelled dataset for training the model on the search queries and verified the effectiveness of various data augmentation techniques. Further, to reduce the latency of the model, we use knowledge distillation and weight quantization. Effectiveness of the proposed method has been validated through experimental evaluations and A/B testing. The model is currently live on Flipkart app and website, serving millions of queries.
翻译:由于互联网和智能手机的覆盖范围广泛,电子商务平台的用户基础日益多样化。由于本地语言用户不熟悉英语,因此他们喜欢的浏览模式是他们的区域语言或区域语言与英语的结合。从我们最近对查询数据的研究中,我们注意到我们收到的许多查询是代码混合,具体地说,Hinglish i.查询用一种或多种印地语写成英文(拉丁文)文字。我们提议以变压器为基础的代码混合查询翻译方法,使用户能够搜索这些查询。我们展示了事先训练过的关于大量未加标记的英文文本的培训的编码解码模型的有效性。我们使用通用域翻译模式创建了一个假标签数据集,用于培训搜索查询模型,并核实各种数据增强技术的有效性。此外,我们使用知识蒸馏和重度转换方法,通过实验评估和A/B测试验证了拟议方法的有效性。目前该模型在Flipkart 应用程序和数百万个网站运行,提供数百万个查询。