Despite much progress in recent years, the vast majority of work in natural language processing (NLP) is on standard languages with many speakers. In this work, we instead focus on low-resource languages and in particular non-standardized low-resource languages. Even within branches of major language families, often considered well-researched, little is known about the extent and type of available resources and what the major NLP challenges are for these language varieties. The first step to address this situation is a systematic survey of available corpora (most importantly, annotated corpora, which are particularly valuable for NLP research). Focusing on Germanic low-resource language varieties, we provide such a survey in this paper. Except for geolocation (origin of speaker or document), we find that manually annotated linguistic resources are sparse and, if they exist, mostly cover morphosyntax. Despite this lack of resources, we observe that interest in this area is increasing: there is active development and a growing research community. To facilitate research, we make our overview of over 80 corpora publicly available. We share a companion website of this overview at https://github.com/mainlp/germanic-lrl-corpora .
翻译:尽管近年来取得了很大的进展,自然语言处理(NLP)的绝大部分工作仍是针对具有许多说话者的标准语言。在本文中,我们转而关注低资源语言,特别是非标准化的低资源语言。即使在被认为已经有很多研究的主要语系中,这些语言的资源可用性、类型及其NLP研究的主要挑战仍鲜为人知。解决这种情况的第一步是对可用语料库进行系统的调查(最重要的是手动注释的语料库,对于NLP研究尤其有价值)。在本文中,我们将重点放在日尔曼语族的低资源语言上,提供这样的调查。除了地理位置(说话者或文档的起源)之外,我们发现手动注释的语言资源很少,如果存在的话,主要是涵盖词法和句法。尽管这种资源匮乏,我们观察到对该领域的兴趣正在增加:有着积极的发展和不断增长的研究社区。为了促进研究,我们公开了80多个语料库的概述。我们在 https://github.com/mainlp/germanic-lrl-corpora 上分享了这个概述的陪伴网站。