Modern neural networks are powerful predictive models. However, when it comes to recognizing that they may be wrong about their predictions, they perform poorly. For example, for one of the most common activation functions, the ReLU and its variants, even a well-calibrated model can produce incorrect but high confidence predictions. In the related task of action recognition, most current classification methods are based on clip-level classifiers that densely sample a given video for non-overlapping, same-sized clips and aggregate the results using an aggregation function - typically averaging - to achieve video level predictions. While this approach has shown to be effective, it is sub-optimal in recognition accuracy and has a high computational overhead. To mitigate both these issues, we propose the confidence distillation framework to teach a representation of uncertainty of the teacher to the student sampler and divide the task of full video prediction between the student and the teacher models. We conduct extensive experiments on three action recognition datasets and demonstrate that our framework achieves significant improvements in action recognition accuracy (up to 20%) and computational efficiency (more than 40%).
翻译:现代神经网络是强大的预测模型。然而,当人们认识到它们可能对其预测错误时,它们表现不佳。例如,对于最常用的激活功能之一,即RELU及其变体,即使是一个校准良好的模型也可以产生不正确但充满信心的预测。在相关的行动识别任务中,大多数目前的分类方法都是基于剪辑级分类器,这些剪辑器为非重叠、同一尺寸的剪辑进行密集抽样,并用一个总合功能(通常平均)对结果进行汇总,以达到视频水平预测。虽然这种方法已经证明是有效的,但它在识别准确性方面是次优的,并且具有很高的计算间接费用。为了缓解这两个问题,我们提议了信任蒸馏框架,以向学生取样员说明教师的不确定性,并将全面视频预测的任务在学生和教师模型之间分配。我们在三个行动识别数据集上进行了广泛的实验,并表明我们的框架在行动识别准确性(高达20 %)和计算效率(超过40%)方面取得了显著的改进。