Video activity recognition by deep neural networks is impressive for many classes. However, it falls short of human performance, especially for challenging to discriminate activities. Humans differentiate these complex activities by recognising critical spatio-temporal relations among explicitly recognised objects and parts, for example, an object entering the aperture of a container. Deep neural networks can struggle to learn such critical relationships effectively. Therefore we propose a more human-like approach to activity recognition, which interprets a video in sequential temporal phases and extracts specific relationships among objects and hands in those phases. Random forest classifiers are learnt from these extracted relationships. We apply the method to a challenging subset of the something-something dataset and achieve a more robust performance against neural network baselines on challenging activities.
翻译:深层神经网络的视频活动识别在许多类别中给人留下了深刻印象,但与人类的性能相去甚远,特别是在歧视活动的挑战方面。人类通过承认明确承认的物体和部件之间的关键时空关系来区分这些复杂活动。例如,进入容器孔径的物体和部件。深层神经网络可以努力有效地学习这种关键关系。因此,我们建议一种更人性化的活动识别方法,在相继时间阶段对视频进行解释,并在这些阶段中提取物体和手之间的具体关系。随机森林分类者从这些提取的关系中学习。我们将这些方法应用于具有挑战性的某样东西数据集,并针对具有挑战性活动的神经网络基线取得更强的性能。