Compressing neural network architectures is important to allow the deployment of models to embedded or mobile devices, and pruning and quantization are the major approaches to compress neural networks nowadays. Both methods benefit when compression parameters are selected specifically for each layer. Finding good combinations of compression parameters, so-called compression policies, is hard as the problem spans an exponentially large search space. Effective compression policies consider the influence of the specific hardware architecture on the used compression methods. We propose an algorithmic framework called Galen to search such policies using reinforcement learning utilizing pruning and quantization, thus providing automatic compression for neural networks. Contrary to other approaches we use inference latency measured on the target hardware device as an optimization goal. With that, the framework supports the compression of models specific to a given hardware target. We validate our approach using three different reinforcement learning agents for pruning, quantization and joint pruning and quantization. Besides proving the functionality of our approach we were able to compress a ResNet18 for CIFAR-10, on an embedded ARM processor, to 20% of the original inference latency without significant loss of accuracy. Moreover, we can demonstrate that a joint search and compression using pruning and quantization is superior to an individual search for policies using a single compression method.
翻译:压缩神经网络结构对于将模型部署到嵌入或移动设备十分重要, 压缩和量化是当今压缩神经网络的主要方法。 两种方法在为每一层专门选择压缩参数时都有好处。 找到压缩参数的好组合, 即所谓的压缩政策, 难度很大, 因为问题跨越了极大搜索空间。 有效的压缩政策会考虑特定硬件结构对使用过的压缩方法的影响。 我们提议了一个叫Gallen的算法框架, 以便利用强化学习来搜索这类政策, 使用纯度和定量, 从而为神经网络提供自动压缩。 与我们使用目标硬件设备中测量的推推力悬度的其他方法相反, 我们用这个框架支持特定硬件目标的模型的压缩。 我们用三种不同的强化学习剂来验证我们的方法对所使用的压缩方法的影响。 除了证明我们的方法的功能外, 我们还能够对嵌入的ARM- 10 处理器的ResNet18 进行压缩, 从而提供神经网络的自动压缩。 与其他方法相反, 我们使用目标硬件设备设备中测测算的推算为20 %, 并且不使用一次整整整整的压方法, 使用一次搜索, 来证明个人的整整整整整。