Subtitles appear on screen as short pieces of text, segmented based on formal constraints (length) and syntactic/semantic criteria. Subtitle segmentation can be evaluated with sequence segmentation metrics against a human reference. However, standard segmentation metrics cannot be applied when systems generate outputs different than the reference, e.g. with end-to-end subtitling systems. In this paper, we study ways to conduct reference-based evaluations of segmentation accuracy irrespective of the textual content. We first conduct a systematic analysis of existing metrics for evaluating subtitle segmentation. We then introduce $Sigma$, a new Subtitle Segmentation Score derived from an approximate upper-bound of BLEU on segmentation boundaries, which allows us to disentangle the effect of good segmentation from text quality. To compare $Sigma$ with existing metrics, we further propose a boundary projection method from imperfect hypotheses to the true reference. Results show that all metrics are able to reward high quality output but for similar outputs system ranking depends on each metric's sensitivity to error type. Our thorough analyses suggest $Sigma$ is a promising segmentation candidate but its reliability over other segmentation metrics remains to be validated through correlations with human judgements.
翻译:在屏幕上出现的字幕是短文本,根据正式限制(长度)和合成/语义标准进行分解,分层可按人文参考来评价分层。但是,当系统产生与参考不同的产出时,无法应用标准分层指标,例如,端到端子图谱系统。在本文中,我们研究如何进行基于参考的分层准确度评价,而不论文字内容如何。我们首先对现有指标进行系统分析,以评价字幕分层。然后,我们引入$Sigma$,根据分层界线的大致上方BLEU得出新的分层分层评分,这使我们无法将良好分层的影响与文本质量脱钩。为了将美元Sigma$与现有的分层系统进行比较,我们进一步提出了从不完善的假设到真实参考的边界预测方法。结果显示,所有指标都能够奖励高质量产出,但类似的产出系统分级取决于每个指标对错误类型的敏感度。我们的透彻分析表明,$Sigma$是一个具有希望的互连度分级分层,但通过其他指标判断的可靠度仍然是其可靠性。