An ongoing challenge in current natural language processing is how its major advancements tend to disproportionately favor resource-rich languages, leaving a significant number of under-resourced languages behind. Due to the lack of resources required to train and evaluate models, most modern language technologies are either nonexistent or unreliable to process endangered, local, and non-standardized languages. Optical character recognition (OCR) is often used to convert endangered language documents into machine-readable data. However, such OCR output is typically noisy, and most word alignment models are not built to work under such noisy conditions. In this work, we study the existing word-level alignment models under noisy settings and aim to make them more robust to noisy data. Our noise simulation and structural biasing method, tested on multiple language pairs, manages to reduce the alignment error rate on a state-of-the-art neural-based alignment model up to 59.6%.
翻译:目前自然语言处理过程中的一个持续挑战是,它的主要进步往往过多地偏向于资源丰富的语言,留下大量资源不足的语言。由于缺乏培训和评价模型所需的资源,大多数现代语言技术要么不存在,要么不可靠,无法处理濒危的当地语言和非标准化语言。光学字符识别(OCR)常常被用来将濒危语言文件转换成机器可读数据。然而,这种OCR产出通常很吵闹,大多数单词对齐模式并不是在如此吵闹的条件下建立的。在这项工作中,我们研究在吵闹的环境中现有的字级对齐模式,目的是使其对吵闹的数据更加有力。我们用多种语言对口测试的噪音模拟和结构偏差方法,设法将最先进的神经型对齐模式的校准率降低到59.6%。