This paper presents an open-source software library that provides a set of finite-state transducer (FST) components and corresponding utilities for manipulating the writing systems of languages that use the Perso-Arabic script. The operations include various levels of script normalization, including visual invariance-preserving operations that subsume and go beyond the standard Unicode normalization forms, as well as transformations that modify the visual appearance of characters in accordance with the regional orthographies for eleven contemporary languages from diverse language families. The library also provides simple FST-based romanization and transliteration. We additionally attempt to formalize the typology of Perso-Arabic characters by providing one-to-many mappings from Unicode code points to the languages that use them. While our work focuses on the Arabic script diaspora rather than Arabic itself, this approach could be adopted for any language that uses the Arabic script, thus providing a unified framework for treating a script family used by close to a billion people.
翻译:本文介绍了一个开放源码软件库,它提供一套有限的传输器组件和相应的公用事业,用于操纵使用Perso-Araly文字的语言的写法系统。操作包括各种层次的脚本正常化,包括包含并超越统一编码标准化格式的视觉不定保留操作,以及根据不同语言家庭11种当代语言的区域拼图改变字符的视觉外观的转换。图书馆还提供简单的基于FST的罗马化和转写。我们还试图通过提供统一编码代码的一对多图解,将Perso-Arigan字符的类型正规化。我们的工作重点是阿拉伯文字散居地,而不是阿拉伯文字本身,但对于使用阿拉伯文字的任何语言,都可以采用这一方法,从而为治疗近10亿人使用的脚本家庭提供了一个统一的框架。