In the study, we empirically compare the two recently proposed decoding methods, i.e. Contrastive Search (CS) and Contrastive Decoding (CD), for open-ended text generation. The automatic evaluation results suggest that, while CS performs worse than CD on the MAUVE metric, it substantially surpasses CD on the diversity and coherence metrics. More notably, extensive human evaluations across three different domains demonstrate that human annotators are universally more in favor of CS over CD with substantial margins. The contradicted results between MAUVE and human evaluations reveal that MAUVE does not accurately reflect human preferences. Therefore, we call upon the research community to develop better evaluation metrics for open-ended text generation. To ensure the reproducibility of our work, we have open-sourced all our code, evaluation results, as well as human annotations at https://github.com/yxuansu/Contrastive_Search_versus_Contrastive_Decoding.
翻译:在这项研究中,我们从经验上比较了最近提出的两种解码方法,即对不限名额的文本生成的对比搜索和对比解码(CD),自动评价结果表明,虽然在MAUVE衡量标准上 CS的表现比CDS差,但在多样性和一致性衡量标准上却大大超过CD。更明显的是,在三个不同领域进行广泛的人类评价表明,人类顾问普遍更赞成CS而不是CD,并有很大的边际。MAUVE和人类评价之间的矛盾结果表明,MAUVE没有准确地反映人类的偏好。因此,我们呼吁研究界为不限名额的文本生成制定更好的评价指标。为了确保我们工作的可复制性,我们拥有了公开来源的所有代码、评价结果,以及在https://github.com/yxuanssu/Contrastive_Search_versus_Contraspive_Decotinging上的人类说明。