Large transformer models, such as BERT, achieve state-of-the-art results in machine reading comprehension (MRC) for open-domain question answering (QA). However, transformers have a high computational cost for inference which makes them hard to apply to online QA systems for applications like voice assistants. To reduce computational cost and latency, we propose decoupling the transformer MRC model into input-component and cross-component. The decoupling allows for part of the representation computation to be performed offline and cached for online use. To retain the decoupled transformer accuracy, we devised a knowledge distillation objective from a standard transformer model. Moreover, we introduce learned representation compression layers which help reduce by four times the storage requirement for the cache. In experiments on the SQUAD 2.0 dataset, a decoupled transformer reduces the computational cost and latency of open-domain MRC by 30-40% with only 1.2 points worse F1-score compared to a standard transformer.
翻译:大型变压器模型,如BERT,在对开放式问题回答(QA)的机器阅读理解(MRC)方面实现最先进的结果。然而,变压器的计算成本很高,因此很难将其应用于在线的QA系统,例如语音助理等应用程序。为了降低计算成本和延迟度,我们提议将变压器的MRC模型分离成输入组件和交叉组件。脱钩使得部分代表计算可以离线进行,并存储到网上使用。为了保持脱钩变压器的准确性,我们从一个标准变压器模型中设计了一个知识蒸馏目标。此外,我们引入了学习的代表性压缩层,帮助将缓存的存储要求减少四倍。在SQUAD 2. 0数据集的实验中,一个分解变压式变压器将开放式 MRC 的计算成本和耐久性降低30-40%,只有1.2个比标准变压器更差的F1级点。