Motivated by the sparsity of NLP resources for Eastern European languages, we present a broad index of existing Eastern European language resources (90+ datasets and 45+ models) published as a github repository open for updates from the community. Furthermore, to support the evaluation of commonsense reasoning tasks, we provide hand-crafted cross-lingual datasets for five different semantic tasks (namely news categorization, paraphrase detection, Natural Language Inference (NLI) task, tweet sentiment detection, and news sentiment detection) for some of the Eastern European languages. We perform several experiments with the existing multilingual models on these datasets to define the performance baselines and compare them to the existing results for other languages.
翻译:在东欧语言国家语言方案资源广度的推动下,我们提出了现有东欧语言资源(90+数据集和45+模型)的广泛索引,作为可开放供社区更新的 Github 库发布,此外,为支持对常识推理任务的评价,我们为一些东欧语言提供五种不同语义任务(即新闻分类、副词探测、自然语言推断任务、推特感应探测和新闻情绪探测)的手工制作的跨语言数据集。我们用这些数据集的现有多语种模型进行了若干次实验,以确定性能基线,并将这些模型与其他语言的现有结果进行比较。