Part-of-Speech (POS) tagging is an important component of the NLP pipeline, but many low-resource languages lack labeled data for training. An established method for training a POS tagger in such a scenario is to create a labeled training set by transferring from high-resource languages. In this paper, we propose a novel method for transferring labels from multiple high-resource source to low-resource target languages. We formalize POS tag projection as graph-based label propagation. Given translations of a sentence in multiple languages, we create a graph with words as nodes and alignment links as edges by aligning words for all language pairs. We then propagate node labels from source to target using a Graph Neural Network augmented with transformer layers. We show that our propagation creates training sets that allow us to train POS taggers for a diverse set of languages. When combined with enhanced contextualized embeddings, our method achieves a new state-of-the-art for unsupervised POS tagging of low-resource languages.
翻译:部分语音标签( POS) 是 NLP 管道中的一个重要部分, 但许多低资源语言缺乏标签数据 。 在这种情况下培训 POS 跳板的既定方法是创建由高资源语言传输的标签培训组。 在本文中, 我们提出了将标签从多种高资源源转换到低资源目标语言的新颖方法 。 我们正式将 POS 标签投影作为基于图形的标签传播 。 如果用多种语言翻译一个句子, 我们通过协调所有语言配对的单词, 创建一个以词为节点和校对链接的边緣的图表 。 我们随后将节点标签从源传播到目标方, 使用由变压器层放大的图象神经网络 。 我们显示我们的传播创建培训组使我们能够为多种语言培训 POS 标签员 。 当与强化背景化嵌入器相结合时, 我们的方法可以实现一个新的状态, 用于对低资源语言进行不受监控的 POS 标记 。