Incorporating stronger syntactic biases into neural language models (LMs) is a long-standing goal, but research in this area often focuses on modeling English text, where constituent treebanks are readily available. Extending constituent tree-based LMs to the multilingual setting, where dependency treebanks are more common, is possible via dependency-to-constituency conversion methods. However, this raises the question of which tree formats are best for learning the model, and for which languages. We investigate this question by training recurrent neural network grammars (RNNGs) using various conversion methods, and evaluating them empirically in a multilingual setting. We examine the effect on LM performance across nine conversion methods and five languages through seven types of syntactic tests. On average, the performance of our best model represents a 19 \% increase in accuracy over the worst choice across all languages. Our best model shows the advantage over sequential/overparameterized LMs, suggesting the positive effect of syntax injection in a multilingual setting. Our experiments highlight the importance of choosing the right tree formalism, and provide insights into making an informed decision.
翻译:在神经语言模型(LMS)中加入更有力的合成偏差是一个长期的目标,但这一领域的研究往往侧重于在成份树库容易获得的地方模拟英文文本。将成份树基LMS扩大到多语种环境(依赖树库比较常见)是可能的,通过依赖树库比较常见。然而,这提出了哪一种树格式最适合学习模型,哪一种语言最适合学习模型。我们利用多种转换方法培训经常性神经网络语法(RNGs)来调查这一问题,并在多语种环境中以经验方式评估它们。我们通过七种合成测试来审查九种转换方法和五种语言的LM性能效果。平均而言,我们的最佳模型的性能比所有语言最坏的选择的精确度提高19 ⁇ 。我们的最佳模型显示了顺序/超度LMS的优势,表明在多语种环境中注射合成语言法的正面效果。我们的实验强调选择正确的树正规主义的重要性,并为作出知情的决定提供洞察力。