Language Identification is the task of identifying a document's language. For applications like automatic spell checker selection, language identification must use very short strings such as text message fragments. In this work, we reproduce a language identification architecture that Apple briefly sketched in a blog post. We confirm the bi-LSTM model's performance and find that it outperforms current open-source language identifiers. We further find that its language identification mistakes are due to confusion between related languages.
翻译:语言识别是确定文件语言的任务。 对于自动拼写检查器选择等应用程序,语言识别必须使用非常短的字符串, 如文本信息碎片。 在这项工作中, 我们复制了苹果在博客文章中简要描述的语言识别结构。 我们确认双LSTM模式的性能, 并发现它优于当前公开源语言识别符。 我们还发现, 语言识别错误是由于相关语言之间的混淆造成的。