保护面罩部分语音模型: 长处建模能帮助不受监督的 POS 标记吗? (Masked Part-Of-Speech Model: Does Modeling Long Context Help Unsupervised POS-tagging?)

Previous Part-Of-Speech (POS) induction models usually assume certain independence assumptions (e.g., Markov, unidirectional, local dependency) that do not hold in real languages. For example, the subject-verb agreement can be both long-term and bidirectional. To facilitate flexible dependency modeling, we propose a Masked Part-of-Speech Model (MPoSM), inspired by the recent success of Masked Language Models (MLM). MPoSM can model arbitrary tag dependency and perform POS induction through the objective of masked POS reconstruction. We achieve competitive results on both the English Penn WSJ dataset as well as the universal treebank containing 10 diverse languages. Though modeling the long-term dependency should ideally help this task, our ablation study shows mixed trends in different languages. To better understand this phenomenon, we design a novel synthetic experiment that can specifically diagnose the model's ability to learn tag agreement. Surprisingly, we find that even strong baselines fail to solve this problem consistently in a very simplified setting: the agreement between adjacent words. Nonetheless, MPoSM achieves overall better performance. Lastly, we conduct a detailed error analysis to shed light on other remaining challenges. Our code is available at https://github.com/owenzx/MPoSM

翻译：上一个部分(POS)上岗模型通常假定某些不以实际语言持有的独立假设(如Markov、单向、本地依赖性),例如,主题动词协议既可以是长期协议,也可以是双向协议。为了便利灵活的依赖模式模式,我们提议了一个蒙面部分(MPOS)模型,这是受最近蒙面语言模型(MLMM)的成功启发的启发。MPosM可以将任意标签依赖性模式建模,并通过蒙面POS重建的目标执行POS上岗模型。我们在英语Penn WSJ数据集以及包含10种不同语言的普遍树库上取得了竞争性结果。尽管长期依赖性协议的建模应该有助于完成这项任务,但我们的消化研究显示了不同语言的混合趋势。为了更好地理解这一现象,我们设计了一个新的合成实验,可以具体判断模型学习标签协议的能力。令人惊讶的是,我们发现甚至强大的基准都无法在一个非常简化的设置中始终解决这个问题:相邻词之间的协议。然而,MPSMSMSM(MO/MO)仍然有更精确地分析。

相关内容

词性标注

关注 389

词性（part-of-speech）是词汇基本的语法属性，通常也称为词类。词性标注就是在给定句子中判定每个词的语法范畴，确定其词性并加以标注的过程，是中文信息处理面临的重要基础性问题。在语料库语言学中，词性标注（POS标注或PoS标注或POST），也称为语法标注，是将文本（语料库）中的单词标注为与特定词性相对应的过程，[1] 基于其定义和上下文。

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

专知会员服务

78+阅读 · 2022年3月15日

ICLR 2021杰出论文奖出炉，8篇论文上榜！

专知会员服务

26+阅读 · 2021年4月2日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

165+阅读 · 2020年3月18日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日