We explore deep clustering of text representations for unsupervised model interpretation and induction of syntax. As these representations are high-dimensional, out-of-the-box methods like KMeans do not work well. Thus, our approach jointly transforms the representations into a lower-dimensional cluster-friendly space and clusters them. We consider two notions of syntax: Part of speech Induction (POSI) and constituency labelling (CoLab) in this work. Interestingly, we find that Multilingual BERT (mBERT) contains surprising amount of syntactic knowledge of English; possibly even as much as English BERT (EBERT). Our model can be used as a supervision-free probe which is arguably a less-biased way of probing. We find that unsupervised probes show benefits from higher layers as compared to supervised probes. We further note that our unsupervised probe utilizes EBERT and mBERT representations differently, especially for POSI. We validate the efficacy of our probe by demonstrating its capabilities as an unsupervised syntax induction technique. Our probe works well for both syntactic formalisms by simply adapting the input representations. We report competitive performance of our probe on 45-tag English POSI, state-of-the-art performance on 12-tag POSI across 10 languages, and competitive results on CoLab. We also perform zero-shot syntax induction on resource impoverished languages and report strong results.
翻译:我们探讨将文本表达方式的深度分组,用于不受监督的模型解释和语法感应。由于这些表达方式是高维的,像 Kmanes 那样的框外方法效果不好。 因此, 我们的方法共同将演示方式转化为一个低维的集群友好空间和分组。 我们考虑两个语法概念: 语言感应和选区标签( Colab) 。 有趣的是, 我们发现多语言BERT (mBERT) 包含大量英语综合学知识, 数量惊人, 甚至可能比英语BERT(EBERT) (EBERT) 还要多。 我们的模型可以用作一个没有监督的、 框外的方法。 我们发现, 未经监督的探测方法显示, 与监督的探测器相比, 更高层次的好处。 我们还注意到, 我们的未经监督的探测器使用 EBERT 和 mBERT 的表达方式不同, 特别是对于 POSSI 。 我们通过演示其不受监督的合成入门入门入门入门技术(EBERTERT), 我们的检测器运行运行良好, 我们的BOSI 报告 10 的10 的测试结果报告 。