Binarized Neural Networks (BNNs) significantly reduce the computation and memory demands with binarized weights and activations compared to full-precision NNs. Executing a layer in a BNN on different devices of a heterogeneous multiprocessor platform consisting of CPU and GPU can affect the inference performance, i.e., accuracy and latency. Usually, a heterogeneous HW platform consisting of a CPU and a GPU is available to execute the BNN workloads. However, to use the heterogeneous HW effectively, it is necessary to find an efficient strategy for BNN workload mapping. In this work, we propose a framework that generates efficient BNN layer-to-device mappings (i.e. suitable parallel configuration for each layer of the model) for execution platforms comprised of CPU and CUDA-capable GPU. We evaluate our proposed framework with two BNN architectures using two well-known datasets, Fashion-MNIST and CIFAR-10, on three hardware platforms with different characteristics. The results show that compared to running a fully-parallelized GPU implementation, our framework generates an efficient configuration up to 2x, 2.6x and 11.8x faster on our tested hardware respectively.
翻译:由 CPU 和 CUDA 能力强的 GPU 构成的多处理平台的不同装置上,在由 CPU 和 GPU 组成的多式多处理平台的不同装置上,在 BNN 中执行一个层,可以影响推断性能,即准确性和延缓性。通常,由 CPU 和 GPU 组成的不同 HW 平台可以执行 BNN 工作量。然而,为了有效地使用混杂的 HW,有必要为 BNN 工作量绘图找到一个有效的战略。在这项工作中,我们提议了一个框架,为由 CPU 和 CUDA 能力强的 GPU组成的执行平台生成高效的 BNN 层到 设备图(即每一层的合适平行配置 ) 。我们用两个众所周知的数据集( Fashashion-MNIST 和 CIFAR-10) 来评估我们提议的框架。在三个具有不同特性的硬件平台上使用两个 BNNNE 。结果显示,与完全平行的GPUx 11 和两个测试的硬件配置相比,我们框架将产生高效的配置。