Despite existing work in machine learning inference serving, ease-of-use and cost efficiency remain challenges at large scales. Developers must manually search through thousands of model-variants -- versions of already-trained models that differ in hardware, resource footprints, latencies, costs, and accuracies -- to meet the diverse application requirements. Since requirements, query load, and applications themselves evolve over time, these decisions need to be made dynamically for each inference query to avoid excessive costs through naive autoscaling. To avoid navigating through the large and complex trade-off space of model-variants, developers often fix a variant across queries, and replicate it when load increases. However, given the diversity across variants and hardware platforms in the cloud, a lack of understanding of the trade-off space can incur significant costs to developers. This paper introduces INFaaS, a managed and model-less system for distributed inference serving, where developers simply specify the performance and accuracy requirements for their applications without needing to specify a specific model-variant for each query. INFaaS generates model-variants, and efficiently navigates the large trade-off space of model-variants on behalf of developers to meet application-specific objectives: (a) for each query, it selects a model, hardware architecture, and model optimizations, (b) it combines VM-level horizontal autoscaling with model-level autoscaling, where multiple, different model-variants are used to serve queries within each machine. By leveraging diverse variants and sharing hardware resources across models, INFaaS achieves 1.3x higher throughput, violates latency objectives 1.6x less often, and saves up to 21.6x in cost (8.5x on average) compared to state-of-the-art inference serving systems on AWS EC2.
翻译:尽管目前在机器学习推论服务、使用方便和成本效率方面开展的工作,但目前仍然面临着巨大的挑战。开发者必须手工搜索数千个模型变量 -- -- 在硬件、资源足迹、延迟、成本和理解方面各不相同的、已经受过训练的模型版本 -- -- 以满足不同的应用要求。由于要求、查询负荷和应用程序本身随时间演变,这些决定需要动态地针对每个推论,以避免通过天真的自动放大,过度成本。为了避免在模型变量的大型和复杂的交易空间中穿行,开发者往往在查询之间设置一个变量,并在增加载荷时复制这些变量。然而,鉴于云中各种变异和硬件平台的多样性,缺乏对交易空间的理解可能会给开发者带来巨大的成本。 由于要求、查询负荷和应用程序本身不断演变,因此,这些决定需要动态者只需通过天真的自动自动调整其应用的性能和准确性要求,而无需为每次查询指定一个特定的模型变量。 InfaS在模型上设置一个变量,在查询中设置一个变量,并高效地浏览大型交易和硬件平台上,在每个模型应用水平上显示每个模型的版本,在模型、数字结构中满足每个版本,用于不同版本。