Deep generative models, such as Variational Autoencoders (VAEs), Generative Adversarial Networks (GANs), Diffusion Models, and Transformers, have shown great promise in a variety of applications, including image and speech synthesis, natural language processing, and drug discovery. However, when applied to engineering design problems, evaluating the performance of these models can be challenging, as traditional statistical metrics based on likelihood may not fully capture the requirements of engineering applications. This paper doubles as a review and a practical guide to evaluation metrics for deep generative models (DGMs) in engineering design. We first summarize well-accepted `classic' evaluation metrics for deep generative models grounded in machine learning theory and typical computer science applications. Using case studies, we then highlight why these metrics seldom translate well to design problems but see frequent use due to the lack of established alternatives. Next, we curate a set of design-specific metrics which have been proposed across different research communities and can be used for evaluating deep generative models. These metrics focus on unique requirements in design and engineering, such as constraint satisfaction, functional performance, novelty, and conditioning. We structure our review and discussion as a set of practical selection criteria and usage guidelines. Throughout our discussion, we apply the metrics to models trained on simple 2-dimensional example problems. Finally, to illustrate the selection process and classic usage of the presented metrics, we evaluate three deep generative models on a multifaceted bicycle frame design problem considering performance target achievement, design novelty, and geometric constraints. We publicly release the code for the datasets, models, and metrics used throughout the paper at decode.mit.edu/projects/metrics/.
翻译:深层基因模型,如Variational Autoencoders(VAE)、General Adversarial Networks(GANs)、Difulation Models(Difulations)和变异器等,在各种应用,包括图像和语音合成、自然语言处理和药物发现等方面,都显示出了巨大的希望。然而,在应用到工程设计问题时,评估这些模型的性能可能具有挑战性,因为基于可能性的传统统计指标可能无法充分反映工程应用的要求。本文是用于评估工程设计中深层基因模型(DGMS)的双倍审查和实用准则。我们首先总结了为人接受的“古典”评价指标,用于基于机器学习理论和典型计算机科学应用的深层基因模型。我们然后通过案例研究,强调为什么这些基准很少能很好地转化出设计问题,但由于缺乏固定的替代方法而经常被使用。我们根据不同研究界提出的一套特定设计计量标准,可用于评估深度基因变异模型。这些衡量标准侧重于设计和工程设计中的独特要求,我们在整个过程中,我们用一个经过训练的标准化标准讨论,我们用来解释的精确标准,最后的标准和标准,我们用来评估。