We introduce BitFit, a sparse-finetuning method where only the bias-terms of the model (or a subset of them) are being modified. We show that with small-to-medium training data, applying BitFit on pre-trained BERT models is competitive with (and sometimes better than) fine-tuning the entire model. For larger data, the method is competitive with other sparse fine-tuning methods. Besides their practical utility, these findings are relevant for the question of understanding the commonly-used process of finetuning: they support the hypothesis that finetuning is mainly about exposing knowledge induced by language-modeling training, rather than learning new task-specific linguistic knowledge.
翻译:我们引入了一种稀有的微调方法BitFit, 这是一种稀有的微调方法,它只对模型的偏差术语( 或其中的一个子集) 进行了修改。 我们表明,如果有中小型培训数据,在经过预先训练的BERT模型中应用BitFit, 对整个模型进行微调是具有竞争力的(有时甚至优于)。对于更大的数据,这个方法与其他稀有的微调方法具有竞争力。 除了实用实用性外,这些发现与理解常用的微调过程有关:它们支持这样的假设,即微调主要是暴露语言建模培训引起的知识,而不是学习新的特定语言知识。