Although adapting pre-trained language models with few examples has shown promising performance on text classification, there is a lack of understanding of where the performance gain comes from. In this work, we propose to answer this question by interpreting the adaptation behavior using post-hoc explanations from model predictions. By modeling feature statistics of explanations, we discover that (1) without fine-tuning, pre-trained models (e.g. BERT and RoBERTa) show strong prediction bias across labels; (2) although few-shot fine-tuning can mitigate the prediction bias and demonstrate promising prediction performance, our analysis shows models gain performance improvement by capturing non-task-related features (e.g. stop words) or shallow data patterns (e.g. lexical overlaps). These observations alert that pursuing model performance with fewer examples may incur pathological prediction behavior, which requires further sanity check on model predictions and careful design in model evaluations in few-shot fine-tuning.
翻译:虽然采用经过培训的语文模型并举几个例子,表明在文本分类方面业绩大有希望,但缺乏对业绩收益来自何方的理解。在这项工作中,我们提议用模型预测中的热量后解释解释适应行为来回答这个问题。通过模拟解释的特征统计,我们发现(1) 不进行微调、经过培训的模型(如BERT和ROBERTA)在标签上显示出强烈的预测偏差;(2) 尽管微调少见微调可以减轻预测偏差并显示有希望的预测业绩,但我们的分析表明,通过捕捉与任务无关的特征(如停止字)或浅层数据模式(如词汇重叠),模型的性能得到改进。这些观察警告指出,用较少的例子来追踪模型性能可能会产生病理学预测行为,这就要求进一步对模型预测进行理智检查,并在少见的微微调中仔细设计模型评价。