Text normalization is a crucial technology for low-resource languages which lack rigid spelling conventions. Low-resource text normalization has so far relied upon hand-crafted rules, which are perceived to be more data efficient than neural methods. In this paper we examine the case of text normalization for Ligurian, an endangered Romance language. We collect 4,394 Ligurian sentences paired with their normalized versions, as well as the first monolingual corpus for Ligurian. We show that, in spite of the small amounts of data available, a compact transformer-based model can be trained to achieve very low error rates by the use of backtranslation and appropriate tokenization. Our datasets are released to the public.
翻译:文本正常化是缺乏僵硬拼写协议的低资源语言的关键技术。 低资源文本正常化迄今为止依赖于手工制作的规则,人们认为这些规则比神经方法更能提高数据效率。 在本文件中,我们研究了濒危罗姆语Ligurian语的文本正常化案例。我们收集了4,394个Ligurian语句及其普通版本配对的版本,以及利古里语的第一个单语版。我们表明,尽管现有数据数量少,但可以培训一个基于压缩变压器的模型,通过使用反译和适当代号来实现非常低的错误率。我们的数据集向公众发布。