The [CLS] token in BERT is commonly used as a fixed-length representation for classification tasks, yet prior work has shown that both other tokens and intermediate layers encode valuable contextual information. In this work, we study lightweight extensions to BERT that refine the [CLS] representation by aggregating information across layers and tokens. Specifically, we explore three modifications: (i) max-pooling the [CLS] token across multiple layers, (ii) enabling the [CLS] token to attend over the entire final layer using an additional multi-head attention (MHA) layer, and (iii) combining max-pooling across the full sequence with MHA. Our approach, called MaxPoolBERT, enhances BERT's classification accuracy (especially on low-resource tasks) without requiring new pre-training or significantly increasing model size. Experiments on the GLUE benchmark show that MaxPoolBERT consistently achieves a better performance than the standard BERT base model on low resource tasks of the GLUE benchmark.
翻译:BERT中的[CLS]词元通常被用作分类任务的固定长度表示,然而先前的研究表明,其他词元及中间层均编码了有价值的上下文信息。本研究探索了对BERT的轻量级扩展,通过聚合跨层和跨词元的信息来优化[CLS]表示。具体而言,我们研究了三种改进方案:(i)对[CLS]词元在多个层级进行最大池化,(ii)通过额外的多头注意力(MHA)层使[CLS]词元能够关注整个最终层,(iii)将全序列最大池化与MHA相结合。我们提出的方法称为MaxPoolBERT,该方案在不需重新预训练且未显著增加模型规模的前提下,提升了BERT的分类准确率(尤其在低资源任务上)。在GLUE基准测试上的实验表明,MaxPoolBERT在GLUE低资源任务上始终优于标准BERT基础模型。