Since the popularization of the Transformer as a general-purpose feature encoder for NLP, many studies have attempted to decode linguistic structure from its novel multi-head attention mechanism. However, much of such work focused almost exclusively on English -- a language with rigid word order and a lack of inflectional morphology. In this study, we present decoding experiments for multilingual BERT across 18 languages in order to test the generalizability of the claim that dependency syntax is reflected in attention patterns. We show that full trees can be decoded above baseline accuracy from single attention heads, and that individual relations are often tracked by the same heads across languages. Furthermore, in an attempt to address recent debates about the status of attention as an explanatory mechanism, we experiment with fine-tuning mBERT on a supervised parsing objective while freezing different series of parameters. Interestingly, in steering the objective to learn explicit linguistic structure, we find much of the same structure represented in the resulting attention patterns, with interesting differences with respect to which parameters are frozen.
翻译:自将变形器普及为全国语言方案通用特征编码器以来,许多研究试图将语言结构从新颖的多头关注机制中解码出来,然而,许多这类工作几乎完全集中于英语 -- -- 一种文字顺序僵硬,缺乏动态形态学的语言。在本研究中,我们提出18种语言的多语种BERT解码实验,以检验依赖性语法在关注模式中反映的主张的一般可接受性。我们表明,从单一注意力主管的基线精度上看,可以对整棵树进行解码,而个别关系往往由各语言的同一负责人跟踪。此外,为了处理最近关于关注状况的辩论,作为一种解释机制,我们尝试在冻结不同参数系列的同时,微调MBERT在受监督的区分目标上进行试验。有趣的是,在引导了解明确语言结构的目标时,我们发现在由此形成的关注模式中代表的同样结构,在哪些参数被冻结方面存在着有趣的差异。