Evaluation metrics that are not robust to dialect variation make it impossible to tell how well systems perform for many groups of users, and can even penalize systems for producing text in lower-resource dialects. However, currently, there exists no way to quantify how metrics respond to change in the dialect of a generated utterance. We thus formalize dialect robustness and dialect awareness as goals for NLG evaluation metrics. We introduce a suite of methods and corresponding statistical tests one can use to assess metrics in light of the two goals. Applying the suite to current state-of-the-art metrics, we demonstrate that they are not dialect-robust and that semantic perturbations frequently lead to smaller decreases in a metric than the introduction of dialect features. As a first step to overcome this limitation, we propose a training schema, NANO, which introduces regional and language information to the pretraining process of a metric. We demonstrate that NANO provides a size-efficient way for models to improve the dialect robustness while simultaneously improving their performance on the standard metric benchmark.
翻译:对方言变异而言,评价的衡量标准不健全,无法说明系统对许多用户群体的运作情况如何,甚至可以惩罚低资源方言的文本制作系统;然而,目前没有办法量化衡量标准如何应对生成方言变异的方法;因此,我们将方言变变异的稳健性和方言意识正式确定为NLG评价指标的目标;我们引入一套方法和相应的统计测试,以便根据两个目标评估衡量标准;将这套方法适用于当前最新指标,我们证明它们不是方言-方言,语扰动往往导致比引入方言特征的尺度减少较少;作为克服这一限制的第一步,我们建议采用培训模式NANO,向指标的培训前进程介绍区域和语言信息;我们证明NANO为模型提供了一个规模高效的方法,以提高方言变稳健性,同时改善其在标准衡量基准上的性能。