With the growing workload of inference tasks on mobile devices, state-of-the-art neural architectures (NAs) are typically designed through Neural Architecture Search (NAS) to identify NAs with good tradeoffs between accuracy and efficiency (e.g., latency). Since measuring the latency of a huge set of candidate architectures during NAS is not scalable, approaches are needed for predicting end-to-end inference latency on mobile devices. Such predictions are challenging due to hardware heterogeneity, optimizations applied by ML frameworks, and the diversity of neural architectures. Motivated by these challenges, in this paper, we first quantitatively assess characteristics of neural architectures and mobile devices that have significant effects on inference latency. Based on this assessment, we propose a latency prediction framework which addresses these challenges by developing operation-wise latency predictors, under a variety of settings and a number of hardware devices, with multi-core CPUs and GPUs, achieving high accuracy in end-to-end latency prediction, as shown by our comprehensive evaluations. To illustrate that our approach does not require expensive data collection, we also show that accurate predictions can be achieved on real-world NAs using only small amounts of profiling data.
翻译:随着移动设备测算任务工作量不断增加,最先进的神经结构(NAs)通常通过神经结构搜索(NAS)来设计,以确定准确性和效率(例如潜伏性)之间有良好权衡的NAS,由于测量NAS期间大量候选结构的潜伏性不易伸缩,因此需要采用各种办法预测移动设备端到端的潜伏值,这种预测具有挑战性,因为硬件差异性、ML框架应用的优化以及神经结构的多样性。受这些挑战的驱动,我们首先从数量上评估对推断性延缓度有重大影响的神经结构和移动装置的特性。我们根据这一评估,提出一个隐伏性预测框架,通过在各种环境和若干硬件装置下开发适合操作的潜伏性预测器来应对这些挑战。多核心的中央支持器和中央支持器,在最终静态结构中实现高精确性预测。正如我们综合预测所显示的那样,我们需要的准确度数据数量。