Recent years have seen an increase in the number of large-scale multilingual NLP projects. However, even in such projects, languages with special processing requirements are often excluded. One such language is Japanese. Japanese is written without spaces, tokenization is non-trivial, and while high quality open source tokenizers exist they can be hard to use and lack English documentation. This paper introduces fugashi, a MeCab wrapper for Python, and gives an introduction to tokenizing Japanese.
翻译:近年来,大型多语言NLP项目的数量有所增加,但是,即使在这类项目中,也有特殊处理要求的语言也常常被排除在外。其中一种语言是日语。日语是日本语,没有空格写字,象征性化是非三维的,虽然存在高质量的开放源代码符号,但很难使用,也缺乏英文文件。 本文为Python介绍了有特殊处理要求的MeCab包装器Fugashi, 并介绍了象征性化日语。