Egocentric action anticipation is the task of predicting the future actions a camera wearer will likely perform based on past video observations. While in a real-world system it is fundamental to output such predictions before the action begins, past works have not generally paid attention to model runtime during evaluation. Indeed, current evaluation schemes assume that predictions can be made offline, and hence that computational resources are not limited. In contrast, in this paper, we propose a "streaming" egocentric action anticipation evaluation protocol which explicitly considers model runtime for performance assessment, assuming that predictions will be available only after the current video segment is processed, which depends on the processing time of a method. Following the proposed evaluation scheme, we benchmark different state-of-the-art approaches for egocentric action anticipation on two popular datasets. Our analysis shows that models with a smaller runtime tend to outperform heavier models in the considered streaming scenario, thus changing the rankings generally observed in standard offline evaluations. Based on this observation, we propose a lightweight action anticipation model consisting in a simple feed-forward 3D CNN, which we propose to optimize using knowledge distillation techniques and a custom loss. The results show that the proposed approach outperforms prior art in the streaming scenario, also in combination with other lightweight models.
翻译:偏心行动预测是预测未来行动的任务。 在现实世界的系统中,在行动开始之前,这种预测的输出至关重要。虽然在现实世界体系中,在行动开始之前,这种预测至关重要,但以往的工程一般没有注意到评价期间的运行时间模型。事实上,目前的评价计划假设预测可以离线进行,因此计算资源并不有限。与此相反,我们在本文件中提议了一个“流”自我中心行动预测评价协议,明确考虑业绩评估的运行时间模型,假设只有在目前的视频段处理后才能提供预测,这取决于方法的处理时间。在拟议的评价计划之后,我们对两种流行数据集的自我中心行动预期采用不同的先进方法。我们的分析表明,运行时间较短的模型往往超过所考虑的流景象情景中较重的模型,从而改变标准离线评价中通常观察到的排序。基于这一观察,我们提议了一个轻度行动预测模型,包括一个简单的向前传输的视频段,这取决于方法的处理时间。我们提议在采用两种流行的模型后,用最先进的技术来优化对自我中心行动的预期。我们提议采用另一种方法的组合,即采用另一种方法,然后用另一种方法,然后用另一种淡模型来显示其他的组合。