The Serbian language is a Slavic language spoken by over 12 million speakers and well understood by over 15 million people. In the area of natural language processing, it can be considered a low-resourced language. Also, Serbian is considered a high-inflectional language. The combination of many word inflections and low availability of language resources makes natural language processing of Serbian challenging. Nevertheless, over the past three decades, there have been a number of initiatives to develop resources and methods for natural language processing of Serbian, ranging from developing a corpus of free text from books and the internet, annotated corpora for classification and named entity recognition tasks to various methods and models performing these tasks. In this paper, we review the initiatives, resources, methods, and their availability.
翻译:塞尔维亚语是一种斯拉夫语言,有超过1200万的使用者,被超过1500万的人所理解。在自然语言处理领域中,它可以被视为一种低资源语言。此外,塞尔维亚语被认为是一种高屈折语言。词形变化较多且语言资源较少的结合使得塞尔维亚语的自然语言处理充满挑战。尽管如此,在过去三十年中,开发塞尔维亚语自然语言处理资源和方法的多个倡议已经出现,范围从开发来自书籍和互联网的自由文本语料库,标记语料库用于分类和命名实体识别任务,到执行这些任务的各种方法和模型。在本文中,我们审查了这些倡议、资源、方法和其可用性。