Deep Neural Networks (DNNs) have become an essential component in many application domains including web-based services. A variety of these services require high throughput and (close to) real-time features, for instance, to respond or react to users' requests or to process a stream of incoming data on time. However, the trend in DNN design is toward larger models with many layers and parameters to achieve more accurate results. Although these models are often pre-trained, the computational complexity in such large models can still be relatively significant, hindering low inference latency. Implementing a caching mechanism is a typical systems engineering solution for speeding up a service response time. However, traditional caching is often not suitable for DNN-based services. In this paper, we propose an end-to-end automated solution to improve the performance of DNN-based services in terms of their computational complexity and inference latency. Our caching method adopts the ideas of self-distillation of DNN models and early exits. The proposed solution is an automated online layer caching mechanism that allows early exiting of a large model during inference time if the cache model in one of the early exits is confident enough for final prediction. One of the main contributions of this paper is that we have implemented the idea as an online caching, meaning that the cache models do not need access to training data and perform solely based on the incoming data at run-time, making it suitable for applications using pre-trained models. Our experiments results on two downstream tasks (face and object classification) show that, on average, caching can reduce the computational complexity of those services up to 58\% (in terms of FLOPs count) and improve their inference latency up to 46\% with low to zero reduction in accuracy.
翻译:深神经网络(DNN) 已成为许多应用领域包括基于网络的服务中一个必不可少的组成部分。 这些服务种类繁多,需要高输送量和(接近)实时功能,例如,对用户的要求作出反应或反应,或及时处理一系列数据。 然而,DNN设计的趋势是向更大的模型发展,其层次和参数很多,以取得更准确的结果。虽然这些模型往往是预先培训,但这种大型模型的计算复杂性仍然相对较大,从而妨碍低推力的精度。 实施一个缓冲机制是加速服务响应时间的典型系统工程解决方案。 然而,传统的缓冲往往不适合基于DNNN的服务。 在本文件中,我们提出一个端到端的自动解决方案,以改善基于计算复杂性和参数的 DNNM服务的性能。 我们的缓冲方法采用了自清低模型和早期退出的概念。 拟议的解决方案是一个自动的线层缓冲机制,使得无法在短时间预测期间提前退出一个大型模型,而我们所执行的在线数据则意味着那些基于纸质的模型。