Deep Learning (DL) has developed to become a corner-stone in many everyday applications that we are now relying on. However, making sure that the DL model uses the underlying hardware efficiently takes a lot of effort. Knowledge about inference characteristics can help to find the right match so that enough resources are given to the model, but not too much. We have developed a DL Inference Performance Predictive Model (DIPPM) that predicts the inference latency, energy, and memory usage of a given input DL model on the NVIDIA A100 GPU. We also devised an algorithm to suggest the appropriate A100 Multi-Instance GPU profile from the output of DIPPM. We developed a methodology to convert DL models expressed in multiple frameworks to a generalized graph structure that is used in DIPPM. It means DIPPM can parse input DL models from various frameworks. Our DIPPM can be used not only helps to find suitable hardware configurations but also helps to perform rapid design-space exploration for the inference performance of a model. We constructed a graph multi-regression dataset consisting of 10,508 different DL models to train and evaluate the performance of DIPPM, and reached a resulting Mean Absolute Percentage Error (MAPE) as low as 1.9%.
翻译:深度学习(DL)已经发展成为我们现在依赖的许多日常应用的基石。然而,确保DL模型有效地使用底层硬件需要很多努力。了解推理特征可以帮助找到正确的匹配,以便为模型提供足够的资源,但不要过多。我们开发了一种DL推理性能预测模型(DIPPM),可以预测在NVIDIA A100 GPU上给定输入的DL模型的推理延迟、能量和内存使用情况。我们还设计了一种算法,根据DIPPM的输出建议适当的A100多实例GPU配置文件。我们开发了一种方法,将用多种框架表达的DL模型转换为DIPPM中使用的通用图结构。这意味着DIPPM能够解析来自多种框架的输入DL模型。我们的DIPPM不仅可以帮助找到适合的硬件配置,还可以帮助执行模型推理性能的快速设计空间探索。我们构建了一个包含10,508个不同DL模型的图多回归数据集,以训练和评估DIPPM的性能,并达到了低至1.9%的平均绝对百分比误差(MAPE)。