This paper reviews and summarizes human evaluation practices described in 97 style transfer papers with respect to three main evaluation aspects: style transfer, meaning preservation, and fluency. In principle, evaluations by human raters should be the most reliable. However, in style transfer papers, we find that protocols for human evaluations are often underspecified and not standardized, which hampers the reproducibility of research in this field and progress toward better human and automatic evaluation methods.
翻译:本文回顾并总结了97份风格转移文件中描述的人类评价做法,涉及三个主要评价方面:风格转移、含义保存和流利。原则上,由人类评级员进行的评价应该是最可靠的。然而,在风格转移文件中,我们发现人类评价的规程往往不够明确,没有标准化,妨碍了该领域研究的再传播,也妨碍了在改进人和自动评价方法方面取得进展。