在这项工作中,我们介绍了一系列的架构修改,旨在提高神经网络的准确性,同时保持他们的GPU训练和推理效率。我们首先演示和讨论由flops优化引起的瓶颈。然后,我们建议更好地利用GPU结构和资产的替代设计。最后,我们介绍了一种新的GPU专用模型,称为TResNet,它比以前的ConvNets具有更好的准确性和效率。使用TResNet模型,与ResNet50的GPU吞吐量相似,在ImageNet上达到80.7%的top-1精度。我们的TResNet模型也能很好地传输竞争数据集,并达到最先进的精度,如Stanford cars(96.0%)、CIFAR-10(99.0%)、CIFAR-100(91.5%)和牛津花卉(99.1%)。实现可在:这个



A lot of deep learning applications are desired to be run on mobile devices. Both accuracy and inference time are meaningful for a lot of them. While the number of FLOPs is usually used as a proxy for neural network latency, it may be not the best choice. In order to obtain a better approximation of latency, research community uses look-up tables of all possible layers for latency calculation for the final prediction of the inference on mobile CPU. It requires only a small number of experiments. Unfortunately, on mobile GPU this method is not applicable in a straight-forward way and shows low precision. In this work, we consider latency approximation on mobile GPU as a data and hardware-specific problem. Our main goal is to construct a convenient latency estimation tool for investigation(LETI) of neural network inference and building robust and accurate latency prediction models for each specific task. To achieve this goal, we build open-source tools which provide a convenient way to conduct massive experiments on different target devices focusing on mobile GPU. After evaluation of the dataset, we learn the regression model on experimental data and use it for future latency prediction and analysis. We experimentally demonstrate the applicability of such an approach on a subset of popular NAS-Benchmark 101 dataset and also evaluate the most popular neural network architectures for two mobile GPUs. As a result, we construct latency prediction model with good precision on the target evaluation subset. We consider LETI as a useful tool for neural architecture search or massive latency evaluation. The project is available at