With many real-world applications of Natural Language Processing (NLP) comprising of long texts, there has been a rise in NLP benchmarks that measure the accuracy of models that can handle longer input sequences. However, these benchmarks do not consider the trade-offs between accuracy, speed, and power consumption as input sizes or model sizes are varied. In this work, we perform a systematic study of this accuracy vs. efficiency trade-off on two widely used long-sequence models - Longformer-Encoder-Decoder (LED) and Big Bird - during fine-tuning and inference on four datasets from the SCROLLS benchmark. To study how this trade-off differs across hyperparameter settings, we compare the models across four sequence lengths (1024, 2048, 3072, 4096) and two model sizes (base and large) under a fixed resource budget. We find that LED consistently achieves better accuracy at lower energy costs than Big Bird. For summarization, we find that increasing model size is more energy efficient than increasing sequence length for higher accuracy. However, this comes at the cost of a large drop in inference speed. For question answering, we find that smaller models are both more efficient and more accurate due to the larger training batch sizes possible under a fixed resource budget.
翻译:自然语言处理(NLP)由长文本构成的众多实际应用自然语言处理(NLP), 测量能够处理较长输入序列的模型精确度的NLP基准提高了。 但是, 这些基准并不考虑精确度、速度和电耗之间的权衡, 即输入大小或模型大小各不相同。 在这项工作中, 我们对两种广泛使用的长效长期模型(长- Encorder-Decoder (LED) 和大鸟) 的精确度进行了系统研究。 我们发现, 长- Ecoder- deoder (LED) 和 Big Bird (Big Bird) 的精确度一直高于 SCROLLS 基准的四个数据集的微调和推断。 然而, 要研究这种权衡在超光谱设置中如何不同, 我们比较了四种序列长度( 1024, 2048, 3072, 4096, 4096) 和两种模型( 大小) 和固定资源预算下两种模型( ) 之间的权衡。 我们发现, LELED的精确度总是比大。 关于能源成本比增加, 更高的比例。