Capturing word meaning in context and distinguishing between correspondences and variations across languages is key to building successful multilingual and cross-lingual text representation models. However, existing multilingual evaluation datasets that evaluate lexical semantics "in-context" have various limitations, in particular, (1) their language coverage is restricted to high-resource languages and skewed in favor of only a few language families and areas, (2) a design that makes the task solvable via superficial cues, which results in artificially inflated (and sometimes super-human) performances of pretrained encoders, on many target languages, which limits their usefulness for model probing and diagnostics, and (3) no support for cross-lingual evaluation. In order to address these gaps, we present AM2iCo, Adversarial and Multilingual Meaning in Context, a wide-coverage cross-lingual and multilingual evaluation set; it aims to faithfully assess the ability of state-of-the-art (SotA) representation models to understand the identity of word meaning in cross-lingual contexts for 14 language pairs. We conduct a series of experiments in a wide range of setups and demonstrate the challenging nature of AM2iCo. The results reveal that current SotA pretrained encoders substantially lag behind human performance, and the largest gaps are observed for low-resource languages and languages dissimilar to English.
翻译:在背景中掌握文字含义,区分各种语文之间的对应和差异,是建立成功的多语文和跨语文文本代表模式的关键,然而,现有的多语文评价数据集,评价词汇语义“内文本”有各种局限性,特别是:(1) 语言覆盖面限于高资源语言,偏向于少数语言的家庭和地区;(2) 设计使任务能够通过肤浅的提示来解脱,导致预先培训的编码员在多种目标语言上人为地夸大(有时甚至超人)表现,限制了其用于示范勘测和诊断的效用;(3) 不支持跨语文评价。 为了消除这些差距,我们介绍了AM2co、Adversarial和多语文的含义,这是一套广泛覆盖的跨语文和多语种评价组合; 设计使任务能够通过肤浅的提示来忠实地评估国家艺术(SotA)代表模式的能力,以14种语言组合为理解跨语文背景中字义的特性,这限制了这些数据对示范和诊断的效用;以及(3) 为了消除这些差距,我们进行一系列广泛的实验,我们介绍AM2语言的设定和多语种语言,展示了最具挑战性的成绩,从而揭示了最难理解的英文。