Deep neural networks (DNNs) frequently contain far more weights, represented at a higher precision, than are required for the specific task which they are trained to perform. Consequently, they can often be compressed using techniques such as weight pruning and quantization that reduce both the model size and inference time without appreciable loss in accuracy. However, finding the best compression strategy and corresponding target sparsity for a given DNN, hardware platform, and optimization objective currently requires expensive, frequently manual, trial-and-error experimentation. In this paper, we introduce a programmable system for model compression called Condensa. Users programmatically compose simple operators, in Python, to build more complex and practically interesting compression strategies. Given a strategy and user-provided objective (such as minimization of running time), Condensa uses a novel Bayesian optimization-based algorithm to automatically infer desirable sparsities. Our experiments on four real-world DNNs demonstrate memory footprint and hardware runtime throughput improvements of 188x and 2.59x, respectively, using at most ten samples per search. We have released a reference implementation of Condensa at https://github.com/NVlabs/condensa.
翻译:深心神经网络(DNNS)通常包含远得多的重量,代表的精确度远高于所培训的具体任务所需的重量,比它们所要完成的具体任务所需的精确度高。因此,它们往往可以使用一些技术来压缩,例如重量裁剪和量化技术,这种技术可以减少模型大小和推算时间,而不会明显地造成准确性损失。然而,找到最佳压缩战略和对特定DNN、硬件平台和优化目标的相应目标宽度,目前需要花费昂贵、经常人工操作、试验和过硬实验。在本文中,我们采用了一种称为Condensa的模型压缩程序可编程系统。用户在Python用程序将简单的操作操作员编组成更复杂、更实际有趣的压缩战略。考虑到战略和用户提供的目标(例如最大限度地减少运行时间),Condensa使用一种新颖的Bayesian优化算法自动推导出理想的宽度。我们在四个真实世界的DNNNNPS的实验中分别显示了188x和2.59x的记忆足迹和硬件运行时间改进,每次搜索时使用最多10个样本。我们发布了CCDdenus/NVA/NVA/Congisa。