Transfer learning with a unified Transformer framework (T5) that converts all language problems into a text-to-text format has recently been proposed as a simple, yet effective, transfer learning approach. Although a multilingual version of the T5 model (mT5) has been introduced, it is not clear how well it can fare on non-English tasks involving diverse data. To investigate this question, we apply mT5 on a language with a wide variety of dialects--Arabic. For evaluation, we use an existing benchmark for Arabic language understanding and introduce a new benchmark for Arabic language generation (ARGEN). We also pre-train three powerful Arabic-specific text-to-text Transformer based models and evaluate them on the two benchmarks. Our new models perform significantly better than mT5 and exceed MARBERT, the current state-of-the-art Arabic BERT-based model, on Arabic language understanding. The models also set new SOTA on the generation benchmark. Our new models and are publicly released at https://github.com/UBC-NLP/araT5 and ARGEN will be released through the same repository.
翻译:将所有语言问题转换成文本到文本格式的统一变换框架(T5)的转移学习最近被提议为一种简单而有效的转移学习方法。虽然采用了多语种的T5模式(mT5),但尚不清楚该模式在涉及多种数据的非英语任务方面能发挥多大作用。为调查这一问题,我们将MT5应用在一种语言上,有多种方言-阿拉伯语的新的变换框架(T5),用于评估,我们使用一种现有的阿拉伯语理解基准,并为阿拉伯语一代引入新的基准。我们还预先开发了三种强大的阿拉伯文本到文本变换模型,并根据两个基准对其进行评估。我们的新模式比MT5运行得好得多,并且超过了目前阿拉伯语种最先进的阿拉伯语BERT模式MARBERT。模型还就新一代基准设定了新的SOTA。我们的新模型将在https://github.com/UBC-NLP/araT5和ARGEN上公开发布,并将通过同一仓库发布。