Subword-level models have been the dominant paradigm in NLP. However, character-level models have the benefit of seeing each character individually, providing the model with more detailed information that ultimately could lead to better models. Recent works have shown character-level models to be competitive with subword models, but costly in terms of time and computation. Character-level models with a downsampling component alleviate this, but at the cost of quality, particularly for machine translation. This work analyzes the problems of previous downsampling methods and introduces a novel downsampling method which is informed by subwords. This new downsampling method not only outperforms existing downsampling methods, showing that downsampling characters can be done without sacrificing quality, but also leads to promising performance compared to subword models for translation.
翻译:但是,字符级模型的好处是单个地看每个字符,为模型提供更详细的信息,最终可以导致更好的模型。最近的工作显示,字符级模型与子词模型具有竞争力,但从时间和计算上讲成本很高。 带有下抽样部分的字符级模型可以缓解这一点,但以质量成本为代价,特别是机器翻译。这项工作分析了以前下抽样方法的问题,并引入了一种以子字为根据的新颖的下抽样方法。这种新的下抽样方法不仅优于现有的下抽样方法,表明下抽样字符可以在不牺牲质量的前提下完成,而且还能带来与翻译子字模型相比的良好业绩。