Various evaluation metrics exist for natural language generation tasks, but they have limited utility for story generation since they generally do not correlate well with human judgments and are not designed to evaluate fine-grained story aspects, such as fluency and relatedness. In this paper, we propose deltascore, an approach that utilizes perturbation to evaluate fine-grained story aspects. Our core idea is based on the hypothesis that the better the story performs in a specific aspect (e.g., fluency), the more it will be affected by a particular perturbation (e.g., introducing typos). To measure the impact, we calculate the likelihood difference between the pre- and post-perturbation stories using large pre-trained language models. We evaluate deltascore against state-of-the-art model-based and traditional similarity-based metrics across two story domains, and investigate its correlation with human judgments on five fine-grained story aspects: fluency, coherence, relatedness, logicality, and interestingness. The findings of our study indicate that the deltascore approach exhibits exceptional performance in evaluating intricate story aspects. An unexpected discovery was made in our experiment, where a single perturbation method was found to effectively capture a majority of these aspects.
翻译:暂无翻译