We present the winning entry to the Multilingual Lexical Normalization (MultiLexNorm) shared task at W-NUT 2021 (van der Goot et al., 2021a), which evaluates lexical-normalization systems on 12 social media datasets in 11 languages. We base our solution on a pre-trained byte-level language model, ByT5 (Xue et al., 2021a), which we further pre-train on synthetic data and then fine-tune on authentic normalization data. Our system achieves the best performance by a wide margin in intrinsic evaluation, and also the best performance in extrinsic evaluation through dependency parsing. The source code is released at https://github.com/ufal/multilexnorm2021 and the fine-tuned models at https://huggingface.co/ufal.
翻译:我们在W-NUT 2021(van der Goot等人,2021年a)上介绍了多语言法正常化(MultiLex-Norm)共同任务的胜利条目,W-NUT 2021(van der Goot等人,2021年a)评估了12个社会媒体数据集中的11种语言的词典标准化系统,我们根据预先培训的字节语言模式BYT5(Xue等人,2021年a)提出解决办法,我们进一步预先培训合成数据,然后对真实的正常化数据进行微调。我们的系统在内在评估中取得了最优的成绩,在通过依赖性分析进行外部评估方面也取得了最佳成绩。源代码在https://github.com/ufal/multilexnorm2021和https://ghingface.co/ufal上发布,并在https://huingfal。