迈向 " 深创模型通用计量标准 " (Toward a Generalization Metric for Deep Generative Models)

from arxiv, 1st I Can't Believe It's Not Better Workshop (ICBINB@NeurIPS 2020). Source code is available at https://github.com/htt210/GeneralizationMetricGAN

Measuring the generalization capacity of Deep Generative Models (DGMs) is difficult because of the curse of dimensionality. Evaluation metrics for DGMs such as Inception Score, Fr\'echet Inception Distance, Precision-Recall, and Neural Net Divergence try to estimate the distance between the generated distribution and the target distribution using a polynomial number of samples. These metrics are the target of researchers when designing new models. Despite the claims, it is still unclear how well can they measure the generalization capacity of a generative model. In this paper, we investigate the capacity of these metrics in measuring the generalization capacity. We introduce a framework for comparing the robustness of evaluation metrics. We show that better scores in these metrics do not imply better generalization. They can be fooled easily by a generator that memorizes a small subset of the training set. We propose a fix to the NND metric to make it more robust to noise in the generated data. Toward building a robust metric for generalization, we propose to apply the Minimum Description Length principle to the problem of evaluating DGMs. We develop an efficient method for estimating the complexity of Generative Latent Variable Models (GLVMs). Experimental results show that our metric can effectively detect training set memorization and distinguish GLVMs of different generalization capacities. Source code is available at https://github.com/htt210/GeneralizationMetricGAN.

翻译：深度放大模型(DGM)的通用能力难以测量,因为其存在对维度的诅咒。DGM(DGM)的通用能力难以测量。DGM(例如:感知分数、Fr\'echet 感知分距、精度-回召、神经网络差异)的评价指标试图用多样本数来估计生成的分布与目标分布之间的距离。这些指标是研究人员在设计新模型时的目标。尽管有这些要求,但它们衡量基因化模型(DGM)的通用能力衡量标准仍然不清楚。我们在本文件中调查这些衡量标准在衡量通用能力方面的能力。我们引入了比较评价指标的稳健性框架。我们表明,这些指标中更好的分数并不意味着更好的概括性。这些分数很容易被一个能重塑一组培训样本的生成者所误导。我们建议对NND衡量标准进行修正,使之更有力地适应生成数据中的噪音。我们为建立一套强有力的通用衡量标准,我们提议将这些最低描述时间原则应用于评估DGGG/GGG通用的通用模型的精确性测试结果。我们为GVAVDRILMS的精确性标准。我们制定了一个有效的标准。