Recognizing the motion of Micro Aerial Vehicles (MAVs) is crucial for enabling cooperative perception and control in autonomous aerial swarms. Yet, vision-based recognition models relying only on RGB data often fail to capture the complex spatial temporal characteristics of MAV motion, which limits their ability to distinguish different actions. To overcome this problem, this paper presents MAVR-Net, a multi-view learning-based MAV action recognition framework. Unlike traditional single-view methods, the proposed approach combines three complementary types of data, including raw RGB frames, optical flow, and segmentation masks, to improve the robustness and accuracy of MAV motion recognition. Specifically, ResNet-based encoders are used to extract discriminative features from each view, and a multi-scale feature pyramid is adopted to preserve the spatiotemporal details of MAV motion patterns. To enhance the interaction between different views, a cross-view attention module is introduced to model the dependencies among various modalities and feature scales. In addition, a multi-view alignment loss is designed to ensure semantic consistency and strengthen cross-view feature representations. Experimental results on benchmark MAV action datasets show that our method clearly outperforms existing approaches, achieving 97.8\%, 96.5\%, and 92.8\% accuracy on the Short MAV, Medium MAV, and Long MAV datasets, respectively.
翻译:识别微型飞行器(MAV)的运动对于实现自主空中集群的协同感知与控制至关重要。然而,仅依赖RGB数据的基于视觉的识别模型往往难以捕捉MAV运动的复杂时空特性,这限制了其区分不同动作的能力。为克服这一问题,本文提出MAVR-Net,一种基于多视角学习的MAV动作识别框架。与传统的单视角方法不同,该方法融合了三种互补的数据类型,包括原始RGB帧、光流和分割掩码,以提升MAV运动识别的鲁棒性与准确性。具体而言,采用基于ResNet的编码器从每个视角提取判别性特征,并引入多尺度特征金字塔以保留MAV运动模式的时空细节。为增强不同视角间的交互,设计了跨视角注意力模块以建模多模态与多尺度特征间的依赖关系。此外,通过构建多视角对齐损失函数来确保语义一致性并强化跨视角特征表示。在基准MAV动作数据集上的实验结果表明,本方法显著优于现有方法,在Short MAV、Medium MAV和Long MAV数据集上分别达到97.8%、96.5%和92.8%的准确率。