Language models have demonstrated the ability to generate highly fluent text; however, it remains unclear whether their output retains coherent high-level structure (e.g., story progression). Here, we propose to apply a statistical tool, model criticism in latent space, to evaluate the high-level structure of the generated text. Model criticism compares the distributions between real and generated data in a latent space obtained according to an assumptive generative process. Different generative processes identify specific failure modes of the underlying model. We perform experiments on three representative aspects of high-level discourse -- coherence, coreference, and topicality -- and find that transformer-based language models are able to capture topical structures but have a harder time maintaining structural coherence or modeling coreference.
翻译:语言模型已经表明能够产生高度流利的文本;然而,仍然不清楚其产出是否保留了连贯的高层次结构(例如,故事进展)。在这里,我们提议采用统计工具,在潜藏空间进行示范批评,评价生成文本的高层次结构;示范批评比较了在根据一个随机基因化过程获得的潜在空间中真实数据与生成数据之间的分布情况;不同的基因化进程确定了基础模型的具体失败模式。我们试验了三个具有代表性的高级别讨论方面 -- -- 一致性、共同参照和专题性 -- -- 并发现基于变压器的语言模型能够捕捉到主题结构,但更难保持结构一致性或建模共同参照。