Pre-trained language models (e.g. BART) have shown impressive results when fine-tuned on large summarization datasets. However, little is understood about this fine-tuning process, including what knowledge is retained from pre-training models or how content selection and generation strategies are learnt across iterations. In this work, we analyze the training dynamics for generation models, focusing on news summarization. Across different datasets (CNN/DM, XSum, MediaSum) and summary properties, such as abstractiveness and hallucination, we study what the model learns at different stages of its fine-tuning process. We find that properties such as copy behavior are learnt earlier in the training process and these observations are robust across domains. On the other hand, factual errors, such as hallucination of unsupported facts, are learnt in the later stages, and this behavior is more varied across domains. Based on these observations, we explore complementary approaches for modifying training: first, disregarding high-loss tokens that are challenging to learn and second, disregarding low-loss tokens that are learnt very quickly. This simple training modification allows us to configure our model to achieve different goals, such as improving factuality or improving abstractiveness.
翻译:培训前语言模型(如BART)在对大型汇总数据集进行微调时显示了令人印象深刻的成果。 但是,对于这一微调过程,我们很少了解什么是培训前模型保留的知识,或者如何通过迭代学习内容选择和生成战略。 在这项工作中,我们分析了生成模型的培训动态,侧重于新闻总结。在不同的数据集(CNN/DM、XSum、MediaSum)和摘要属性(如抽象和幻觉)中,我们研究了模型在微调过程的不同阶段学到了什么。我们发现,复制行为等属性在培训过程中早期就学到了,这些观测在各个领域都很健全。另一方面,事实错误,如对无支持事实的幻觉,在后期学习,而这种行为在不同的领域更加不同。根据这些观察,我们探索了修改培训的互补方法:首先,忽略了难于学习的高损失符号,其次,而其次,忽略了快速学习的低损失符号。这一简单培训修改使我们得以配置模型以达到不同的目标,作为改进事实。