词性(part-of-speech)是词汇基本的语法属性,通常也称为词类。词性标注就是在给定句子中判定每个词的语法范畴,确定其词性并加以标注的过程,是中文信息处理面临的重要基础性问题。在语料库语言学中,词性标注(POS标注或PoS标注或POST),也称为语法标注,是将文本(语料库)中的单词标注为与特定词性相对应的过程,[1] 基于其定义和上下文。

VIP内容

题目: Should All Cross-Lingual Embeddings Speak English?

摘要:

最近关于跨语言词嵌入的研究大多以英语为中心。绝大多数词汇归纳评价词典都介于英语和另一种语言之间,在多语言环境下学习时,默认选择英语嵌入空间作为中心。然而,通过这项工作,我们对这些实践提出了挑战。首先,我们证明了中心语言的选择对下游词汇归纳和零标注词性标注性能有显著的影响。其次,我们都扩展了一个以英语为中心的标准评估词典集合,以包括所有使用三角统计的语言对,并为代表不足的语言创建新的词典。对所有这些语言对的现有方法进行评估,有助于了解它们是否适合对来自遥远语言的嵌入进行校准,并为该领域带来新的挑战。最后,在我们的分析中,我们确定了强跨语言嵌入基线的一般准则,扩展到不包括英语的语言对。

成为VIP会员查看完整内容
0
6

最新论文

Recent impressive improvements in NLP, largely based on the success of contextual neural language models, have been mostly demonstrated on at most a couple dozen high-resource languages. Building language models and, more generally, NLP systems for non-standardized and low-resource languages remains a challenging task. In this work, we focus on North-African colloquial dialectal Arabic written using an extension of the Latin script, called NArabizi, found mostly on social media and messaging communication. In this low-resource scenario with data displaying a high level of variability, we compare the downstream performance of a character-based language model on part-of-speech tagging and dependency parsing to that of monolingual and multilingual models. We show that a character-based model trained on only 99k sentences of NArabizi and fined-tuned on a small treebank of this language leads to performance close to those obtained with the same architecture pre-trained on large multilingual and monolingual models. Confirming these results a on much larger data set of noisy French user-generated content, we argue that such character-based language models can be an asset for NLP in low-resource and high language variability set-tings.

0
0
下载
预览
Top