This paper studies the relative importance of attention heads in Transformer-based models to aid their interpretability in cross-lingual and multi-lingual tasks. Prior research has found that only a few attention heads are important in each mono-lingual Natural Language Processing (NLP) task and pruning the remaining heads leads to comparable or improved performance of the model. However, the impact of pruning attention heads is not yet clear in cross-lingual and multi-lingual tasks. Through extensive experiments, we show that (1) pruning a number of attention heads in a multi-lingual Transformer-based model has, in general, positive effects on its performance in cross-lingual and multi-lingual tasks and (2) the attention heads to be pruned can be ranked using gradients and identified with a few trial experiments. Our experiments focus on sequence labeling tasks, with potential applicability on other cross-lingual and multi-lingual tasks. For comprehensiveness, we examine two pre-trained multi-lingual models, namely multi-lingual BERT (mBERT) and XLM-R, on three tasks across 9 languages each. We also discuss the validity of our findings and their extensibility to truly resource-scarce languages and other task settings.
翻译:本文研究了在以变异器为基础的模型中,关注负责人的相对重要性,以帮助其在跨语言和多语言的任务中进行解释; 先前的研究发现,在每种单一语言的自然语言处理(NLP)任务中,只有少数关注负责人在每一个单一语言的自然语言处理(NLP)任务中很重要,对其余负责人的剪裁导致模型的可比较性或改进; 然而,在跨语言和多语言的任务中,削减关注负责人的影响尚不明确。 通过广泛的实验,我们发现:(1) 在多语言的变异器模型中,将一些关注负责人的可解释性对其在跨语言和多语言的任务中的绩效产生了积极影响;(2) 拟调整的注意负责人可以使用梯度进行评级,并用少数实验来确定。 我们的实验重点是排序任务顺序,有可能适用于其他跨语言和多语言的任务。 为了全面性,我们研究了两种经过预先培训的多语言模式,即多语言的BERT(M)和XLM-R(XLM-R),分别涉及三种语言的任务和其他任务。 我们还讨论了我们的调查结果的有效性及其在真正资源定位上的任务设置中的存在性。