In this paper, we present our progress in pre-training monolingual Transformers for Czech and contribute to the research community by releasing our models for public. The need for such models emerged from our effort to employ Transformers in our language-specific tasks, but we found the performance of the published multilingual models to be very limited. Since the multilingual models are usually pre-trained from 100+ languages, most of low-resourced languages (including Czech) are under-represented in these models. At the same time, there is a huge amount of monolingual training data available in web archives like Common Crawl. We have pre-trained and publicly released two monolingual Czech Transformers and compared them with relevant public models, trained (at least partially) for Czech. The paper presents the Transformers pre-training procedure as well as a comparison of pre-trained models on text classification task from various domains.
翻译:在本文中,我们介绍了捷克在培训前单一语言变换器方面的进展,并通过公布我们的公共模式为研究界作出贡献。这种模式之所以有必要,是因为我们努力利用变换器执行我们的语言特有任务,但我们发现出版的多语种模式的绩效非常有限。由于多语种模式通常以100+语言进行预先培训,因此大多数低资源语言(包括捷克语)在这些模式中的代表性不足。与此同时,像通用Crawl这样的网络档案库中有大量单一语言的培训数据。我们已经预先培训了两个单语捷克变换器,并公开公布了两个单语的捷克变换器,与相关公共模式进行了比较,为捷克人培训了相关的(至少部分)模式。本文介绍了变换器培训前程序,并比较了不同领域的文本分类任务培训前模式。