Online inference is becoming a key service product for many businesses, deployed in cloud platforms to meet customer demands. Despite their revenue-generation capability, these services need to operate under tight Quality-of-Service (QoS) and cost budget constraints. This paper introduces KAIROS, a novel runtime framework that maximizes the query throughput while meeting QoS target and a cost budget. KAIROS designs and implements novel techniques to build a pool of heterogeneous compute hardware without online exploration overhead, and distribute inference queries optimally at runtime. Our evaluation using industry-grade deep learning (DL) models shows that KAIROS yields up to 2X the throughput of an optimal homogeneous solution, and outperforms state-of-the-art schemes by up to 70\%, despite advantageous implementations of the competing schemes to ignore their exploration overhead.
翻译:在线推论正在成为许多企业的关键服务产品,被部署到云层平台以满足客户需求。尽管这些服务具有创收能力,但它们需要在严格的服务质量(Qos)和成本预算限制下运作。本文件介绍了KAIROS,这是一个新的运行时间框架,在满足Qos目标和成本预算的同时最大限度地增加查询量,这是一个新的运行时间框架。KAIROS设计并采用新的技术,以建立一个多种计算硬件库,而无需在线勘探间接费用,并在运行时以最佳的方式分发推断查询。我们利用行业级深层次学习模式进行的评估表明,KAIROS的产量高达2X,是最佳同质解决方案的通过量,并且超过最新计划70 ⁇,尽管竞争计划的实施有利于忽略其勘探间接费用。