Recent work on tokenizer-free multilingual pretrained models show promising results in improving cross-lingual transfer and reducing engineering overhead (Clark et al., 2022; Xue et al., 2022). However, these works mainly focus on reporting accuracy on a limited set of tasks and data settings, placing less emphasis on other important factors when tuning and deploying the models in practice, such as memory usage, inference speed, and fine-tuning data robustness. We attempt to fill this gap by performing a comprehensive empirical comparison of multilingual tokenizer-free and subword-based models considering these various dimensions. Surprisingly, we find that subword-based models might still be the most practical choice in many settings, achieving better performance for lower inference latency and memory usage. Based on these results, we encourage future work in tokenizer-free methods to consider these factors when designing and evaluating new models.
翻译:最近关于无代用品的多语种预先培训模式的工作显示,在改进跨语言转让和减少工程间接费用方面,取得了有希望的成果(Clark等人,2022年;薛等人,2022年),然而,这些工作主要侧重于报告有限任务和数据设置的准确性,在实际调整和部署模式时,不那么强调其他重要因素,如记忆使用、推断速度和微调数据稳健性。我们试图填补这一差距,对考虑到这些不同层面的多语言无代用品和次级字基模式进行全面的经验性比较。奇怪的是,我们发现基于子字的模型在很多情况下可能仍然是最实际的选择,在较低推论时间和记忆使用方面实现更好的业绩。基于这些结果,我们鼓励今后采用无代用品方法,在设计和评价新模式时考虑这些因素。