FUTH-Net:利用时际关系和整体特征进行空中视频分类 (FuTH-Net: Fusing Temporal Relations and Holistic Features for Aerial Video Classification)

Unmanned aerial vehicles (UAVs) are now widely applied to data acquisition due to its low cost and fast mobility. With the increasing volume of aerial videos, the demand for automatically parsing these videos is surging. To achieve this, current researches mainly focus on extracting a holistic feature with convolutions along both spatial and temporal dimensions. However, these methods are limited by small temporal receptive fields and cannot adequately capture long-term temporal dependencies which are important for describing complicated dynamics. In this paper, we propose a novel deep neural network, termed FuTH-Net, to model not only holistic features, but also temporal relations for aerial video classification. Furthermore, the holistic features are refined by the multi-scale temporal relations in a novel fusion module for yielding more discriminative video representations. More specially, FuTH-Net employs a two-pathway architecture: (1) a holistic representation pathway to learn a general feature of both frame appearances and shortterm temporal variations and (2) a temporal relation pathway to capture multi-scale temporal relations across arbitrary frames, providing long-term temporal dependencies. Afterwards, a novel fusion module is proposed to spatiotemporal integrate the two features learned from the two pathways. Our model is evaluated on two aerial video classification datasets, ERA and Drone-Action, and achieves the state-of-the-art results. This demonstrates its effectiveness and good generalization capacity across different recognition tasks (event classification and human action recognition). To facilitate further research, we release the code at https://gitlab.lrz.de/ai4eo/reasoning/futh-net.

翻译：无人驾驶航空飞行器(UAVs)由于成本低、流动性快,现在被广泛应用于数据获取。随着航空视频数量不断增加,自动解析这些视频的需求正在急剧增加。为了实现这一目标,当前研究主要侧重于提取一个整体特征,同时在空间和时间两个层面都有变化。然而,这些方法受到小型时间可接受字段的限制,无法充分捕捉对于描述复杂动态十分重要的长期时间依赖性。在本文中,我们提议建立一个新型的深度神经网络(称为FUTH-Net),不仅模拟整体特征,而且模拟空中视频分类的时际关系。此外,通过一个新型的时际关系来完善这些整体特征,以产生更具歧视性的视频演示。更具体地说,FUTH-Net采用双向结构:(1) 一种整体代表路径,以学习框架外观和短期时间变化的一般特征,以及(2) 一种时间关系路径,以获取跨任意框架的多种规模的时间关系,提供长期的时际依赖性。之后,一个新型的聚合模块,通过一个新的时间关系模块,通过一个新型的时际连接模块,通过一个新型的时空流关系来完善的时空流关系来完善的时空关系来完善的时空关系,通过一个模块,通过新的时间连接模块,通过一个新的时间连接模块,通过新的时间连接模块来完善的时空关系,将视频模块,通过新的时间连接式的时空连接模块,将视频的时空关系,将视频模块进行整合成一个新的时间分解式的连接,,,通过新的时间分解式的模型,将数据整合成一个模块,通过一个模块,通过新的时间分解式的路径,将数据定位,将数据定位,将数据定位,将数据转换到两段,将数据分,将数据定位,将数据分分分分分解,将数据分解。