The increased interest in Artificial Intelligence (AI) raised the need for highly optimized and sophisticated AI frameworks. Starting with the Lua-based Torch many frameworks have emerged over time, such as Theano, Caffe, Chainer, CNTK, MxNet, PyTorch, DL4J, or TensorFlow. All of these provide a high level scripting API that allows users to easily design neural networks and run these on various kinds of hardware. What the user usually does not see is the high effort put into these frameworks to provide peak execution performance. While mainstream CPUs and GPUs have the "luxury" to have a wide spread user base in the open source community, less mainstream CPU, GPU or accelerator vendors need to put in a high effort to get their hardware supported by these frameworks. This includes not only the development of highly efficient compute libraries such as CUDNN, OneDNN or VEDNN but also supporting an ever growing number of simpler compute operations such as summation and multiplications. Each of these frameworks, nowadays, supports several hundred of unique operations, with tensors of various sizes, shapes and data types, which end up in thousands of compute kernels required for each device type. And the number of operations keeps increasing. That is why NEC Laboratories Europe started developing the SOL AI Optimization project already years ago, to deliver optimal performance to users while keeping the maintenance burden minimal.
翻译:对人工智能(AI)的日益关注提高了对高度优化和精密的人工智能框架的需求。 从基于Lua的火炬框架开始,随着时间推移,出现了许多框架,例如Theano、Cafe、Claxer、CNTK、MxNet、PyTorrch、DL4J或TensorFlow。所有这些框架都提供了高水平的脚本设计 API,使用户能够方便地设计神经网络并在各种硬件上运行这些网络。用户通常看不到的是在这些框架中为提供高峰执行性能而付出的巨大努力。主流的CPU和GPUS都有“奢侈”功能,以便在开放源社区中拥有广泛的分散用户基础,而较少主流的CPU、GPU或加速器供应商需要投入大量的努力,以获得这些框架所支持的硬件。这不仅包括开发高效的编译图书馆,例如CUDNNN、ONNN或VEDNNO等。用户通常看不到的是,而且支持越来越多的简化的计算作业,例如总和倍增。这些框架的每个框架都支持着着“奢侈”的几百个运行,现在,而不断保持着每个特殊的运行。