We investigate how Multilingual BERT (mBERT) encodes grammar by examining how the high-order grammatical feature of morphosyntactic alignment (how different languages define what counts as a "subject") is manifested across the embedding spaces of different languages. To understand if and how morphosyntactic alignment affects contextual embedding spaces, we train classifiers to recover the subjecthood of mBERT embeddings in transitive sentences (which do not contain overt information about morphosyntactic alignment) and then evaluate them zero-shot on intransitive sentences (where subjecthood classification depends on alignment), within and across languages. We find that the resulting classifier distributions reflect the morphosyntactic alignment of their training languages. Our results demonstrate that mBERT representations are influenced by high-level grammatical features that are not manifested in any one input sentence, and that this is robust across languages. Further examining the characteristics that our classifiers rely on, we find that features such as passive voice, animacy and case strongly correlate with classification decisions, suggesting that mBERT does not encode subjecthood purely syntactically, but that subjecthood embedding is continuous and dependent on semantic and discourse factors, as is proposed in much of the functional linguistics literature. Together, these results provide insight into how grammatical features manifest in contextual embedding spaces, at a level of abstraction not covered by previous work.
翻译:我们通过研究不同语言的嵌入空间中如何体现地貌一致的高阶语法特征(不同语言如何定义何为“主体”),从而调查多种语言在不同语言的嵌入空间中如何体现地貌一致的高阶语法特征(不同语言如何定义何为“主体”);为了了解地貌融合是否以及如何影响着背景嵌入空间,我们培训分类人员以恢复 mBERT 嵌入过渡性判决的主体特征(不含关于形态融合的公开信息),然后在语言内部和跨语言之间评估对跨语言判决(主题分类取决于对齐的语系分类)的零弹射。我们发现,由此产生的分类分布反映了其培训语言的畸形语法一致性一致性。我们的结果表明, mBERT 的表述方式受到任何输入句中未显示的高层次语系特征的影响,而且这在各种语言之间是稳健的。我们进一步审视了我们分类者所依赖的特征,我们发现,诸如被动的语音、隐性以及案件与分类决定密切相关,但与分类决定密切相关。我们发现,因此,MERT的分类分布式分布式分布并不包含功能性特征,因此,在前文系的文系系中,因此,这些正统系系系的文系系系系系系系系系系系系系系系系系系系系系系系系系系系系系系系系系系系系系系系系系系系系系系系系系系系系系系系系系系系基础。