Improvement in machine learning-based NLP performance are often presented with bigger models and more complex code. This presents a trade-off: better scores come at the cost of larger tools; bigger models tend to require more during training and inference time. We present multiple methods for measuring the size of a model, and for comparing this with the model's performance. In a case study over part-of-speech tagging, we then apply these techniques to taggers for eight languages and present a novel analysis identifying which taggers are size-performance optimal. Results indicate that some classical taggers place on the size-performance skyline across languages. Further, although the deep models have highest performance for multiple scores, it is often not the most complex of these that reach peak performance.
翻译:在基于机器学习的NLP性能的改进中,通常会展示更大的模型和更复杂的代码。这带来了一个权衡:以更大的工具为代价来获得更好的分数;在培训和推算时间期间,更大的模型往往需要更多的时间。我们提出了多种方法来衡量模型的大小,并将它与模型的性能进行比较。在对部分语音标记的案例研究中,我们然后将这些技术应用到8种语言的标签标签上,并提出新的分析,确定哪些标签是最佳的尺寸-性能。结果显示有些古典标记者在跨语言的大小-性能天线上找到了位置。此外,虽然深层模型对于多重得分的性能最高,但通常不是最复杂的方法。