Foundational Hebrew NLP tasks such as segmentation, tagging and parsing, have relied to date on various versions of the Hebrew Treebank (HTB, Sima'an et al. 2001). However, the data in HTB, a single-source newswire corpus, is now over 30 years old, and does not cover many aspects of contemporary Hebrew on the web. This paper presents a new, freely available UD treebank of Hebrew stratified from a range of topics selected from Hebrew Wikipedia. In addition to introducing the corpus and evaluating the quality of its annotations, we deploy automatic validation tools based on grew (Guillaume, 2021), and conduct the first cross domain parsing experiments in Hebrew. We obtain new state-of-the-art (SOTA) results on UD NLP tasks, using a combination of the latest language modelling and some incremental improvements to existing transformer based approaches. We also release a new version of the UD HTB matching annotation scheme updates from our new corpus.
翻译:迄今为止,希伯来希伯来国家劳工局的基础性希伯来民族劳工局任务,如分割、标记和分割等,一直依赖希伯来树库的各种版本(HTB、Sima'an等人,2001年)。然而,HTB这一单一来源新闻网络的数据已超过30年,并不涵盖网络上的当代希伯来人的许多方面。本文介绍了从希伯来维基百科选择的一系列专题中免费获得的希伯来语新UD树库。除了引入该块并评估其说明的质量外,我们还根据生长(Guillaume,2021年)采用自动验证工具,并以希伯来语进行第一个跨域分割实验。我们获得了关于UDNLP任务的最新艺术状态(SOTA)结果,使用了最新的语言模型和对现有基于变异器的方法的逐步改进。我们还发布了新版的UDHTB匹配新加注计划,以我们的新版的文更新。