Generative AI has matured to a point where large-scale models can generate text that seems indistinguishable from human-written text and remarkably photorealistic images. Automatically measuring how close the distribution of generated data is to the target real data distribution is a key step in diagnosing existing models and developing better models. We present MAUVE, a family of comparison measures between pairs of distributions such as those encountered in the generative modeling of text or images. These scores are statistical summaries of divergence frontiers capturing two types of errors in generative modeling. We explore four approaches to statistically estimate these scores: vector quantization, non-parametric estimation, classifier-based estimation, and parametric Gaussian approximations. We provide statistical bounds for the vector quantization approach. Empirically, we find that the proposed scores paired with a range of $f$-divergences and statistical estimation methods can quantify the gaps between the distributions of human-written text and those of modern neural language models by correlating with human judgments and identifying known properties of the generated texts. We conclude the paper by demonstrating its applications to other AI domains and discussing practical recommendations.
翻译:人工智能已经成熟到这样一个地步:大型模型可以产生与人类写成的文字和惊人的摄影现实图像无法区分的文本。自动测量生成数据的分布接近目标真实数据分布是诊断现有模型和开发更好的模型的关键一步。我们介绍了MAUVE,这是对分布配对的比较措施的组合,如在文本或图像的基因化模型中遇到的分布配对。这些分数是差异边界的统计摘要,它捕捉了基因化模型中的两种错误。我们探索了四种统计估计得分的方法:矢量定量、非参数估计、基于分类的估算和准度高斯近似值。我们为矢量定量化方法提供了统计界限。我们偶然地发现,拟议的分数与以$-divergence和统计估计方法相匹配的幅度可以量化人类写成的文本分布与现代神经语言模型的分布差距,我们通过将人类判断与已知的文本的已知属性联系起来,我们通过展示其实际应用到其他应用领域来完成文件。