Minimal sentence pairs are frequently used to analyze the behavior of language models. It is often assumed that model behavior on contrastive pairs is predictive of model behavior at large. We argue that two conditions are necessary for this assumption to hold: First, a tested hypothesis should be well-motivated, since experiments show that contrastive evaluation can lead to false positives. Secondly, test data should be chosen such as to minimize distributional discrepancy between evaluation time and deployment time. For a good approximation of deployment-time decoding, we recommend that minimal pairs are created based on machine-generated text, as opposed to human-written references. We present a contrastive evaluation suite for English-German MT that implements this recommendation.
翻译:语言模型行为分析经常使用最小的句子配对来分析语言模型的行为。 人们常常认为,对比对的模型行为是整个模型行为的预测。 我们争辩说,这一假设必须有两个条件:第一,测试的假设应该具有良好的动机,因为实验表明对比评价可能导致假正数。第二,应选择测试数据,以尽可能减少评价时间和部署时间之间的分布差异。对于部署时间解码的近似,我们建议,最小的配对是根据机器生成的文本创建的,而不是根据人写的参考书。我们为执行这项建议的英德MT提供了一个对比式的评估套件。