We present a new pre-trained language model (PLM) for modern Hebrew, termed AlephBERTGimmel, which employs a much larger vocabulary (128K items) than standard Hebrew PLMs before. We perform a contrastive analysis of this model against all previous Hebrew PLMs (mBERT, heBERT, AlephBERT) and assess the effects of larger vocabularies on task performance. Our experiments show that larger vocabularies lead to fewer splits, and that reducing splits is better for model performance, across different tasks. All in all this new model achieves new SOTA on all available Hebrew benchmarks, including Morphological Segmentation, POS Tagging, Full Morphological Analysis, NER, and Sentiment Analysis. Subsequently we advocate for PLMs that are larger not only in terms of number of layers or training data, but also in terms of their vocabulary. We release the new model publicly for unrestricted use.
翻译:我们为现代希伯来语提出了一个新的培训前语言模型(PLM),称为AlephBERTGimmel,它使用比以前标准的希伯莱语模型(LEW)要大得多的词汇(128K项),比以前标准的希伯来语模型(LEPM)要多得多。我们对所有希伯来语模型(mBERT、HeBERT、AlephBERT)进行了对比分析,并评估了更大的词汇对任务绩效的影响。我们的实验表明,较大的词汇导致更少的分裂,减少分裂对于不同任务的模型性能来说更好。所有这些新模型在全部现有的希伯来语基准上都实现了新的SOTA,包括道德分类、POS Tagging、完全的道德分析、NER和感官分析。我们随后倡导不仅在层次或培训数据的数量方面,而且在词汇方面都更大。我们公开发布新的模型,供不受限制使用。