There is often variation in the shape and size of input data used for deep learning. In many cases, such data can be represented using tensors with non-uniform shapes, or ragged tensors. Due to limited and non-portable support for efficient execution on ragged tensors, current deep learning frameworks generally use techniques such as padding and masking to make the data shapes uniform and then offload the computations to optimized kernels for dense tensor algebra. Such techniques can, however, lead to a lot of wasted computation and therefore, a loss in performance. This paper presents CoRa, a tensor compiler that allows users to easily generate efficient code for ragged tensor operators targeting a wide range of CPUs and GPUs. Evaluating CoRa on a variety of operators on ragged tensors as well as on an encoder layer of the transformer model, we find that CoRa (i)performs competitively with hand-optimized implementations of the operators and the transformer encoder and (ii) achieves, over PyTorch, a 1.6X geomean speedup for the encoder on an Nvidia GPU and a 1.86X geomean speedup for the multi-head attention module used in transformers on an ARM CPU.
翻译:用于深层学习的输入数据的形状和大小往往各有不同,在许多情况下,这些数据可以用不统一形状或破碎的加压器的加压器来表示。由于对压碎的加压器和GPUs的高效执行支持有限且非便携式,目前的深层学习框架通常使用诸如垫子和遮罩等技术来使数据形状统一,然后将计算结果卸下来,以优化密度高温代数的内核。然而,这种技术可能导致大量浪费计算,从而造成性能损失。本文展示了CoRA,一个高压编译器,使用户能够很容易地为压碎的抗冲器操作器和GPUPs生成有效的代码。评估了压碎式加压器和变压器模型的电解码层。我们发现,CRA(i)与操作器和变压器的手操作器实施竞争,从而造成性能损失。本文展示了CRA,使用户能够轻松过后,为以各种CPUS和GAUDUSU1号的GMAUS-GUS-Slishe-Slippe-hemodheard Speophepplipplipple AS-hedududududududududud 。