Lexical ambiguity is widespread in language, allowing for the reuse of economical word forms and therefore making language more efficient. If ambiguous words cannot be disambiguated from context, however, this gain in efficiency might make language less clear -- resulting in frequent miscommunication. For a language to be clear and efficiently encoded, we posit that the lexical ambiguity of a word type should correlate with how much information context provides about it, on average. To investigate whether this is the case, we operationalise the lexical ambiguity of a word as the entropy of meanings it can take, and provide two ways to estimate this -- one which requires human annotation (using WordNet), and one which does not (using BERT), making it readily applicable to a large number of languages. We validate these measures by showing that, on six high-resource languages, there are significant Pearson correlations between our BERT-based estimate of ambiguity and the number of synonyms a word has in WordNet (e.g. $\rho = 0.40$ in English). We then test our main hypothesis -- that a word's lexical ambiguity should negatively correlate with its contextual uncertainty -- and find significant correlations on all 18 typologically diverse languages we analyse. This suggests that, in the presence of ambiguity, speakers compensate by making contexts more informative.
翻译:语言中模糊不清的现象十分普遍,允许重新使用经济文字形式,从而提高语言效率。但是,如果无法从上下文中分离出模糊不清的字词,那么,效率的提高可能会使语言变得不那么清楚 -- -- 导致频繁的错误交流。为了使一种语言能够清楚和有效地编码,我们假设一个词类型的词汇模糊性应该与平均多少信息背景提供有关。为了调查这种情况是否如此,我们将一个词的词汇模糊化作为它可以使用的含义的加密,并提供两种方法来估计这个词 -- -- 一种需要人类注解(使用WordNet),一种没有(使用BERT),使语言容易适用于大量语言。我们验证这些措施,在六种高资源语言上,我们的BERT的模糊性估计和WordNet中一个词的同义性数量之间有着显著的关联性(例如:$\rho=0.40美元)。我们然后检验我们的主要假设 -- -- 18个词的语系模糊性应该使它更容易地反映其背景上的不确定性。