Deep convolutional networks have recently achieved great success in video recognition, yet their practical realization remains a challenge due to the large amount of computational resources required to achieve robust recognition. Motivated by the effectiveness of quantization for boosting efficiency, in this paper, we propose a dynamic network quantization framework, that selects optimal precision for each frame conditioned on the input for efficient video recognition. Specifically, given a video clip, we train a very lightweight network in parallel with the recognition network, to produce a dynamic policy indicating which numerical precision to be used per frame in recognizing videos. We train both networks effectively using standard backpropagation with a loss to achieve both competitive performance and resource efficiency required for video recognition. Extensive experiments on four challenging diverse benchmark datasets demonstrate that our proposed approach provides significant savings in computation and memory usage while outperforming the existing state-of-the-art methods.
翻译:深相连动网络最近在视频识别方面取得了巨大成功,但由于实现强力识别需要大量计算资源,其实际实现仍是一项挑战。 本文中,我们提出一个动态网络量化框架,为每个框架选择最佳精确度,以高效视频识别投入为条件。 具体地说,一个视频剪辑显示,我们培训了一个与识别网络平行的非常轻重的网络,以制定动态政策,表明每个框架在识别视频时要使用何种数字精确度。 我们对两个网络进行有效的培训,利用标准反向精确度,亏损以达到视频识别所需的竞争性性能和资源效率。 对四套挑战性不同的基准数据集进行的广泛实验表明,我们拟议的方法在计算和记忆使用方面节省了大量费用,同时超过了现有的最新方法。