We present the MEMA framework for the easy and quick derivation of efficient inference runtimes that minimize external memory accesses for matrix multiplication on TinyML systems. The framework accounts for hardware resource constraints and problem sizes in analytically determining optimized schedules and kernels that minimize memory accesses. MEMA provides a solution to a well-known problem in the current practice, that is, optimal schedules tend to be found only through a time consuming and heuristic search of a large scheduling space. We compare the performance of runtimes derived from MEMA to existing state-of-the-art libraries on ARM-based TinyML systems. For example, for neural network benchmarks on the ARM Cortex-M4, we achieve up to a 1.8x speedup and 44% energy reduction over CMSIS-NN.
翻译:我们提出了MEMA框架,以在微控制器上针对矩阵乘法实现最小化外部存储器访问的高效推理运行时的简单快速派生。该框架考虑硬件资源限制和问题尺寸,在分析中确定最小化存储器访问的优化调度和核心。MEMA提供了一种解决当前实践中众所周知问题的方法,即仅通过耗时且启发式的搜索大调度空间找到最优调度。我们将MEMA派生的运行时性能与基于ARM的TinyML系统上的现有最先进库进行比较。例如,在ARM Cortex-M4上进行神经网络基准测试,我们实现了比CMSIS-NN高达1.8倍的速度提升和44%的功耗降低。