Python has become the de-facto language for training deep neural networks, coupling a large suite of scientific computing libraries with efficient libraries for tensor computation such as PyTorch or TensorFlow. However, when models are used for inference they are typically extracted from Python as TensorFlow graphs or TorchScript programs in order to meet performance and packaging constraints. The extraction process can be time consuming, impeding fast prototyping. We show how it is possible to meet these performance and packaging constraints while performing inference in Python. In particular, we present a way of using multiple Python interpreters within a single process to achieve scalable inference and describe a new container format for models that contains both native Python code and data. This approach simplifies the model deployment story by eliminating the model extraction step, and makes it easier to integrate existing performance-enhancing Python libraries. We evaluate our design on a suite of popular PyTorch models on Github, showing how they can be packaged in our inference format, and comparing their performance to TorchScript. For larger models, our packaged Python models perform the same as TorchScript, and for smaller models where there is some Python overhead, our multi-interpreter approach ensures inference is still scalable.
翻译:Python 已成为培训深神经网络的脱facto语言, 并结合了一大套科学计算图书馆, 包括高效的智能计算图书馆, 如 PyTorrch 或 TensorFlow 。 但是, 当模型被用于推断时, 通常从 Python 中提取, 通常作为 TensorFlow 图形或 TorchScript 程序, 以满足性能和包装限制 。 提取过程可能耗时, 妨碍快速原型设计。 我们展示了如何在 Python 中执行这些性能和包装限制。 特别是, 我们展示了如何在一个过程中使用多个 Python 解释器来实现可缩放的推断, 描述包含本地 Python 代码和数据的模型的新容器格式。 这种方法通过消除模型提取步骤来简化模型部署故事, 并更容易整合现有的增强性能的 Python 图书馆。 我们仍然在 Github 上使用流行的 PyTorrch 模型的套装, 显示它们如何在更大的模型中被包装成一个更小的模型, 的模型, 并比较其性模型。