We present LoopStack, a domain specific compiler stack for tensor operations, composed of a frontend, LoopTool, and an efficient optimizing code generator, LoopNest. This stack enables us to compile entire neural networks and generate code targeting the AVX2, AVX512, NEON, and NEONfp16 instruction sets while incorporating optimizations often missing from other machine learning compiler backends. We evaluate our stack on a collection of full neural networks and commonly used network blocks as well as individual operators, and show that LoopStack generates machine code that matches and frequently exceeds the performance of in state-of-the-art machine learning frameworks in both cases. We also show that for a large collection of schedules LoopNest's compilation is orders of magnitude faster than LLVM, while resulting in equal or improved run time performance. Additionally, LoopStack has a very small memory footprint - a binary size of 245KB, and under 30K lines of effective code makes it ideal for use on mobile and embedded devices.
翻译:我们展示了LoopStack, 是一个专门用于高压操作的域名编译器堆, 由前端、 LoopTool 和高效优化代码生成器LoopNest 组成。 这个堆叠使我们能够编集整个神经网络, 并生成针对 AVX2、 AVX512、 近地天体N和近地天体Nfp16 指令集的代码, 同时纳入其他机器学习编译器后端中经常缺少的优化。 我们评估了我们收集的完整神经网络和常用网络块以及个体操作器的堆叠, 并显示 LoopStack 生成的机器代码匹配并经常超过在两种情况下最先进的机器学习框架中的性能。 我们还显示, 大量时间序列编集的LoopNest 编集比 LLLLVM 更快, 并导致同样或改进的时间性能。 此外, LoopStack 拥有非常小的记忆足迹 — 一个二进尺寸为 245KB, 低于 30K 的有效代码线, 使它在移动和嵌入设备上的理想。