With the recent advances of open-domain story generation, the lack of reliable automatic evaluation metrics becomes an increasingly imperative issue that hinders the fast development of story generation. According to conducted researches in this regard, learnable evaluation metrics have promised more accurate assessments by having higher correlations with human judgments. A critical bottleneck of obtaining a reliable learnable evaluation metric is the lack of high-quality training data for classifiers to efficiently distinguish plausible and implausible machine-generated stories. Previous works relied on \textit{heuristically manipulated} plausible examples to mimic possible system drawbacks such as repetition, contradiction, or irrelevant content in the text level, which can be \textit{unnatural} and \textit{oversimplify} the characteristics of implausible machine-generated stories. We propose to tackle these issues by generating a more comprehensive set of implausible stories using {\em plots}, which are structured representations of controllable factors used to generate stories. Since these plots are compact and structured, it is easier to manipulate them to generate text with targeted undesirable properties, while at the same time maintain the grammatical correctness and naturalness of the generated sentences. To improve the quality of generated implausible stories, we further apply the adversarial filtering procedure presented by \citet{zellers2018swag} to select a more nuanced set of implausible texts. Experiments show that the evaluation metrics trained on our generated data result in more reliable automatic assessments that correlate remarkably better with human judgments compared to the baselines.
翻译:随着最近公开故事生成的进展,缺乏可靠的自动评价指标已成为一个越来越紧迫的问题,阻碍了故事生成的快速发展。根据在这方面进行的研究,通过提高与人类判断的关联性,可学习的评价指标承诺进行更准确的评估。获得可靠可学习的评价指标的一个关键瓶颈是缺乏高质量的培训数据,使分类者能够有效地区分可信和不可信的机器生成的故事。先前的工作依赖于可追溯性{湿性操纵}的可信实例,以模拟系统可能的缺陷,如重复、矛盾或文本水平上不相关的内容,这可以是\ textit{不自然}和\textit{超自然}和\text{超简化}不可信的机器生成故事的特征。我们提议解决这些问题的方法是,利用 emplage} 生成一套更全面的不可信的故事,这是用来产生故事的可控制因素的结构化的表示。由于这些图表是紧凑和结构化的,因此更容易利用有目标的不准确性属性来调整文本,而有针对性地生成文本,而同时将人类生成更准确的正确性判断性判断性结果在不断提高。