We present a learning algorithm for human activity recognition in videos. Our approach is designed for UAV videos, which are mainly acquired from obliquely placed dynamic cameras that contain a human actor along with background motion. Typically, the human actors occupy less than one-tenth of the spatial resolution. Our approach simultaneously harnesses the benefits of frequency domain representations, a classical analysis tool in signal processing, and data driven neural networks. We build a differentiable static-dynamic frequency mask prior to model the salient static and dynamic pixels in the video, crucial for the underlying task of action recognition. We use this differentiable mask prior to enable the neural network to intrinsically learn disentangled feature representations via an identity loss function. Our formulation empowers the network to inherently compute disentangled salient features within its layers. Further, we propose a cost-function encapsulating temporal relevance and spatial content to sample the most important frame within uniformly spaced video segments. We conduct extensive experiments on the UAV Human dataset and the NEC Drone dataset and demonstrate relative improvements of 5.72% - 13.00% over the state-of-the-art and 14.28% - 38.05% over the corresponding baseline model.
翻译:我们的方法是为视频中人类活动识别提供一个学习算法。 我们的方法是为无人机视频设计一个不同的静态和动态频率掩码,主要来自隐蔽的动态摄像机,里面含有一个人类演员和背景动作。 通常, 人类演员占用的空间分辨率不到十分之一。 我们的方法同时利用频率域表、 信号处理的经典分析工具和数据驱动神经网络的好处。 我们在模拟视频中突出的静态和动态像素之前,先建立一个不同的静态和动态频率掩码,这对行动识别的根本任务至关重要。 我们之前使用这个不同的遮罩,使神经网络能够通过身份丧失功能内在地学习分解的特征表现。 我们的配方使网络能够在其层内固有的分解显著特征。 此外, 我们提出一个成本-功能,包装时间相关性和空间内容,以在统一空间视频段内取样最重要的框架。 我们在UAV人类数据集和NEC Droone数据集上进行广泛的实验,并对行动识别基本任务至关重要。 我们使用这个不同的遮罩, 使神经网络能够通过身份损失功能来从本质上学习分解的特征。 我们的配制使网络能够将网络从内在地吸收本模型和14. 0.5.705%的5.72%的相对改进。