In recent times, BERT-based models have been extremely successful in solving a variety of natural language processing (NLP) tasks such as reading comprehension, natural language inference, sentiment analysis, etc. All BERT-based architectures have a self-attention block followed by a block of intermediate layers as the basic building component. However, a strong justification for the inclusion of these intermediate layers remains missing in the literature. In this work we investigate the importance of intermediate layers on the overall network performance of downstream tasks. We show that reducing the number of intermediate layers and modifying the architecture for BERT-BASE results in minimal loss in fine-tuning accuracy for downstream tasks while decreasing the number of parameters and training time of the model. Additionally, we use centered kernel alignment and probing linear classifiers to gain insight into our architectural modifications and justify that removal of intermediate layers has little impact on the fine-tuned accuracy.
翻译:在最近的自然语言处理(NLP)任务中,基于BERT的模型已经非常成功,如阅读理解、自然语言推理、情感分析等。所有基于BERT的架构都有自我注意力块和中间层块作为基本建设组件。然而,有关包含这些中间层的强有力的正当理由在文献中仍然缺失。在本文中,我们探究了对下游任务整体网络性能的中间层的重要性。我们表明,减少中间层数量并修改BERT-BASE的架构,在下游任务的微调精度方面导致最小的损失,同时减少模型的参数和训练时间。此外,我们使用中心核对齐和探针线性分类器来深入了解我们的架构修改,并证明删除中间层对微调准确性影响很小。