In up-to-date machine learning (ML) applications on cloud or edge computing platforms, batching is an important technique for providing efficient and economical services at scale. In particular, parallel computing resources on the platforms, such as graphics processing units (GPUs), have higher computational and energy efficiency with larger batch sizes. However, larger batch sizes may also result in longer response time, and thus it requires a judicious design. This paper aims to provide a dynamic batching policy that strikes a balance between efficiency and latency. The GPU-based inference service is modeled as a batch service queue with batch-size dependent processing time. Then, the design of dynamic batching is a continuous-time average-cost problem, and is formulated as a semi-Markov decision process (SMDP) with the objective of minimizing the weighted sum of average response time and average power consumption. The optimal policy is acquired by solving an associated discrete-time Markov decision process (MDP) problem with finite state approximation and "discretization". By creatively introducing an abstract cost to reflect the impact of "tail" states, the space complexity and the time complexity of the procedure can decrease by 63.5% and 98%, respectively. Our results show that the optimal policies potentially possess a control limit structure. Numerical results also show that SMDP-based batching policies can adapt to different traffic intensities and outperform other benchmark policies. Furthermore, the proposed solution has notable flexibility in balancing power consumption and latency.
翻译:在更新云层或边缘计算平台的机器学习(ML)应用中,批量是大规模提供高效和经济服务的重要技术,特别是平台上的平行计算资源,如图形处理器(GPUs),具有较大的批量规模,具有较高的计算和能源效率;然而,批量规模较大还可能导致较长的反应时间,因此需要明智的设计。本文件旨在提供一个动态的批量政策,在效率和延缓之间实现平衡。基于 GPU 的推论服务是按批量数量大小的处理时间进行批量服务队列的模型。然后,动态批量设计是一个持续时间平均成本问题,是半Markov决定程序(SMDP),目的是最大限度地减少平均反应时间和平均电耗的加权总和。通过解决相关的离散时间Markov 决策程序(MDP) 问题,以有限的状态近似和“不均匀化” 。创造性地引入抽象成本,以反映“连续”状态、空间复杂性和清晰的消费政策的影响。