NLP research in Hebrew has largely focused on morphology and syntax, where rich annotated datasets in the spirit of Universal Dependencies are available. Semantic datasets, however, are in short supply, hindering crucial advances in the development of NLP technology in Hebrew. In this work, we present ParaShoot, the first question answering dataset in modern Hebrew. The dataset follows the format and crowdsourcing methodology of SQuAD, and contains approximately 3000 annotated examples, similar to other question-answering datasets in low-resource languages. We provide the first baseline results using recently-released BERT-style models for Hebrew, showing that there is significant room for improvement on this task.
翻译:希伯来语的NLP研究主要侧重于形态学和语法,那里具备了具有普遍依赖精神的丰富的附加说明数据集。但语义数据集供应不足,妨碍了希伯来语NLP技术开发的重大进展。在这项工作中,我们介绍了现代希伯来语第一个回答数据集的问题ParaShoot。数据集遵循了SQAD的格式和众包方法,并包含了大约3,000个附加说明的例子,类似于其他低资源语言的问答数据集。我们用最近发行的希伯来语BERT型希伯来语模型提供了第一个基线结果,表明在这项任务上有很大的改进余地。