Many online action prediction models observe complete frames to locate and attend to informative subregions in the frames called glimpses and recognize an ongoing action based on global and local information. However, in applications with constrained resources, an agent may not be able to observe the complete frame, yet must still locate useful glimpses to predict an incomplete action based on local information only. In this paper, we develop Glimpse Transformers (GliTr), which observe only narrow glimpses at all times, thus predicting an ongoing action and the following most informative glimpse location based on the partial spatiotemporal information collected so far. In the absence of a ground truth for the optimal glimpse locations for action recognition, we train GliTr using a novel spatiotemporal consistency objective: We require GliTr to attend to the glimpses with features similar to the corresponding complete frames (i.e. spatial consistency) and the resultant class logits at time t equivalent to the ones predicted using whole frames up to t (i.e. temporal consistency). Inclusion of our proposed consistency objective yields ~10% higher accuracy on the Something-Something-v2 (SSv2) dataset than the baseline cross-entropy objective. Overall, despite observing only ~33% of the total area per frame, GliTr achieves 53.02%and 93.91% accuracy on the SSv2 and Jester datasets, respectively.
翻译:许多在线行动预测模型观察完整框架,以定位和关注信息丰富的次区域,其框架称为光观,并承认基于全球和地方信息的持续行动。然而,在资源有限的应用中,一个代理可能无法观察完整框架,但仍需找到有用的一瞥,仅根据当地信息预测不完全的行动。在本文中,我们开发了Glimpse变异器(GliTr),这些变异器在任何时候都只观察狭窄的一瞥,从而预测正在采取的行动,以及根据迄今收集的部分空洞信息,在以下最丰富的一瞥位置。在缺乏关于最佳瞄准行动识别地点的地面真相的情况下,我们用一个全新的随机时间一致性目标来培训GliTr:我们要求GliTr参加与相应的完整框架(即空间一致性)相似的一瞥,结果类日志在时间上相当于使用整个SSstock t(即时间一致性)的预测,包括我们拟议的一致性目标显示某些事物的某类-直观点识别点识别点识别点的精确度为10%(SS-v2),在总目标和总基准区域中,仅实现总目标的G-%的数据。