Pretrained Transformer based models finetuned on domain specific corpora have changed the landscape of NLP. However, training or fine-tuning these models for individual tasks can be time consuming and resource intensive. Thus, a lot of current research is focused on using transformers for multi-task learning (Raffel et al.,2020) and how to group the tasks to help a multi-task model to learn effective representations that can be shared across tasks (Standley et al., 2020; Fifty et al., 2021). In this work, we show that a single multi-tasking model can match the performance of task specific models when the task specific models show similar representations across all of their hidden layers and their gradients are aligned, i.e. their gradients follow the same direction. We hypothesize that the above observations explain the effectiveness of multi-task learning. We validate our observations on our internal radiologist-annotated datasets on the cervical and lumbar spine. Our method is simple and intuitive, and can be used in a wide range of NLP problems.
翻译:根据具体领域公司进行微调的、以特定领域公司为主的成熟型变压模型改变了NLP的面貌。然而,针对个别任务的培训或微调这些模型可能耗时耗资和资源密集,因此,目前许多研究的重点是利用变压器进行多任务学习(Raffel等人,202020年),以及如何将任务分组,以帮助多任务模型学习能够跨任务共享的有效表达方式(Standley等人,2020年;50等人,2021年)。在这项工作中,当任务特定模型显示其所有隐藏层都有相似的表达方式,其梯度也遵循同一方向,即其梯度也遵循相同的方向。我们假设上述观察可以解释多任务学习的效果。我们验证了我们对宫颈和腰脊上内部无线电学家附加说明数据集的观察结果。我们的方法简单明晰,可以用于广泛的NLP问题。