Text-to-image generation and image captioning are recently emerged as a new experimental paradigm to assess machine intelligence. They predict continuous quantity accompanied by their sampling techniques in the generation, making evaluation complicated and intractable to get marginal distributions. Based on a recent trend that multimodal generative evaluations exploit a vison-and-language pre-trained model, we propose the negative Gaussian cross-mutual information using the CLIP features as a unified metric, coined by Mutual Information Divergence (MID). To validate, we extensively compare it with competing metrics using carefully-generated or human-annotated judgments in text-to-image generation and image captioning tasks. The proposed MID significantly outperforms the competitive methods by having consistency across benchmarks, sample parsimony, and robustness toward the exploited CLIP model. We look forward to seeing the underrepresented implications of the Gaussian cross-mutual information in multimodal representation learning and the future works based on this novel proposition.
翻译:最近出现了一种新的实验模式,用于评估机器情报。它们预测连续数量,同时在生成过程中采用抽样技术,使评估变得复杂和棘手,以获得边际分布。根据最近多式基因评价利用一种语言预培训模式的趋势,我们提议采用高斯语的跨互动信息,将高斯语的负面信息作为统一的衡量标准,由相互信息差异(MID)生成。为了验证,我们广泛比较它与相互竞争的计量标准,在生成文本到图像和图像说明任务中采用精心生成的或人文附加说明的判断。拟议的多式信息通过在基准、样本均匀度和对被利用的CLIP模式的稳健性方面保持一致,大大优于竞争方法。我们期待看到高斯语跨互动信息在多式代表学习中的代表性影响以及基于这一新观点的未来工程。