The flourish of deep learning frameworks and hardware platforms has been demanding an efficient compiler that can shield the diversity in both software and hardware in order to provide application portability. Among the existing deep learning compilers, TVM is well known for its efficiency in code generation and optimization across diverse hardware devices. In the meanwhile, the Sunway many-core processor renders itself as a competitive candidate for its attractive computational power in both scientific computing and deep learning workloads. This paper combines the trends in these two directions. Specifically, we propose swTVM that extends the original TVM to support ahead-of-time compilation for architecture requiring cross-compilation such as Sunway. In addition, we leverage the architecture features during the compilation such as core group for massive parallelism, DMA for high bandwidth memory transfer and local device memory for data locality, in order to generate efficient codes for deep learning workloads on Sunway. The experiment results show that the codes generated by swTVM achieves 1.79x on average compared to the state-of-the-art deep learning framework on Sunway, across six representative benchmarks. This work is the first attempt from the compiler perspective to bridge the gap of deep learning and Sunway processor particularly with productivity and efficiency in mind. We believe this work will encourage more people to embrace the power of deep learning and Sunway many-core processor.
翻译:深层学习框架和硬件平台的蓬勃发展要求有一个高效的编译器,能够保护软件和硬件的多样性,从而提供可应用的便捷性。在现有的深层学习编译器中,TVM以其在各种硬件装置的代码生成和优化方面的效率而著称。与此同时,Sunway多个核心处理器将自己作为具有吸引力的科学计算和深层次学习工作量的具有吸引力的计算能力的竞争性候选器。本文综合了这两个方向的趋势。具体地说,我们建议SwTVM扩展原TVM,以支持Sunway等需要交叉合成的架构的超时编译。此外,我们在编译过程中利用这些架构的特征,例如大规模平行式核心组、高带内存传输DMA和数据地点的本地设备记忆等,以便产生高效的代码,在科学计算和深层学习工作量方面。实验结果显示,SwTVM生成的代码平均达到1.79x,与Sunway上最先进的深层学习框架相比,跨越六个有代表性的基准。这项工作是从汇编和深层学习过程中首次尝试,我们从这一深度思考了太阳和深层学习过程。