Fusion-in-Decoder (FiD) is a powerful retrieval-augmented language model that sets the state-of-the-art on many knowledge-intensive NLP tasks. However, FiD suffers from very expensive inference. We show that the majority of inference time results from memory bandwidth constraints in the decoder, and propose two simple changes to the FiD architecture to speed up inference by 7x. The faster decoder inference then allows for a much larger decoder. We denote FiD with the above modifications as FiDO, and show that it strongly improves performance over existing FiD models for a wide range of inference budgets. For example, FiDO-Large-XXL performs faster inference than FiD-Base and achieves better performance than FiD-Large.
翻译:Fuision- in-Decoder (FiD) 是一个强大的检索强化语言模型,它决定了许多知识密集型 NLP 任务的最新水平。 然而, FiD 受到非常昂贵的推论。 我们显示,大部分推论时间都是由解码器的内存带宽限制所致, 并提议对 FiD 结构进行两个简单的修改, 以加快7x的推论。 更快的解码推论则可以产生一个更大的解码器。 我们用上述修改来表示FiDo, 并表明它大大改善了现有FiD 模型的性能, 用于一系列广泛的推论预算。 例如, FiDO- Large-XXL 比FID-L Base的推论速度要快, 并且比FiD- Large 取得更好的性能。