More than 2 billion mobile users worldwide type in multiple languages in the soft keyboard. On a monolingual keyboard, 38% of falsely auto-corrected words are valid in another language. This can be easily avoided by detecting the language of typed words and then validating it in its respective language. Language detection is a well-known problem in natural language processing. In this paper, we present a fast, light-weight and accurate Language Detection Engine (LDE) for multilingual typing that dynamically adapts to user intended language in real-time. We propose a novel approach where the fusion of character N-gram model and logistic regression based selector model is used to identify the language. Additionally, we present a unique method of reducing the inference time significantly by parameter reduction technique. We also discuss various optimizations fabricated across LDE to resolve ambiguity in input text among the languages with the same character pattern. Our method demonstrates an average accuracy of 94.5% for Indian languages in Latin script and that of 98% for European languages on the code-switched data. This model outperforms fastText by 60.39% and ML-Kit by 23.67% in F1 score for European languages. LDE is faster on mobile device with an average inference time of 25.91 microseconds.
翻译:在软键盘中,全世界超过20亿多种多种语言的移动用户类型在软键盘中。 在单语键盘上,38%的错误自动校正的文字在另一种语言中有效。 通过探测输入的文字的语言语言,然后用其各自的语言验证,这很容易避免。语言探测是自然语言处理中众所周知的一个问题。在本文中,我们提出了一个快速、轻量和准确的多语种搜索引擎(LDE),用于多语种打字,能够动态地实时适应用户想要的语言。我们提出了一个新颖的方法,即使用字符N-gram模型和物流回归选择器模型的组合来识别语言。此外,我们提出了一个独特的方法,通过减少参数技术大大缩短推论时间。我们还讨论了各种在LDE中制造的优化,以解决具有相同特点的语言输入文本中的模糊性。我们的方法显示,在拉丁文字中,印度语的平均准确度为94.5%,在代码转换数据中,欧洲语言的平均准确度为98%。这个模型比快速Text增加了60.39 %,而ML-Kinsex回归模型用来鉴别语言,在23.67%的移动速度为25.57%的欧洲语言的平均时间评分。