Multilingual generative language models (LMs) are increasingly fluent in a large variety of languages. Trained on the concatenation of corpora in multiple languages, they enable powerful transfer from high-resource languages to low-resource ones. However, it is still unknown what cultural biases are induced in the predictions of these models. In this work, we focus on one language property highly influenced by culture: formality. We analyze the formality distributions of XGLM and BLOOM's predictions, two popular generative multilingual language models, in 5 languages. We classify 1,200 generations per language as formal, informal, or incohesive and measure the impact of the prompt formality on the predictions. Overall, we observe a diversity of behaviors across the models and languages. For instance, XGLM generates informal text in Arabic and Bengali when conditioned with informal prompts, much more than BLOOM. In addition, even though both models are highly biased toward the formal style when prompted neutrally, we find that the models generate a significant amount of informal predictions even when prompted with formal text. We release with this work 6,000 annotated samples, paving the way for future work on the formality of generative multilingual LMs.
翻译:多语言基因语言模型(LMS)在多种语言中越来越流畅。在以多种语言对公司进行融合的培训中,以多种语言对公司进行集中化培训,使公司能够从高资源语言向低资源语言进行强有力的转移。然而,在预测这些模型时,还不清楚是什么文化偏见。在这项工作中,我们侧重于受文化严重影响的一种语言属性:形式性。我们分析了XGLM和BLOOM两种流行的多语言模型,即XGLM和BLOOM的预测的正规化分布,两种通用多语言模型,以5种语言为单位。我们将每一种语言的1 200代人分类为正式、非正式或混杂语言,并衡量迅速的正规化对预测的影响。总的来说,我们观察了不同模式和语言的行为多样性。例如,XGLM在以非正式提示为条件的情况下,在阿拉伯语和孟加拉语中生成非正式文本,远远多于BLOM。此外,尽管这两种模型在中性激励时都高度偏向正式风格偏向,但我们发现,这些模型产生了大量非正式的非正式预测,即使是在正式文本推动下,我们以6000种形式发布了关于未来工作的方式。</s>