Efficient deployment of deep learning models for aerial object detection on resource-constrained devices requires significant compression without com-promising performance. In this study, we propose a novel three-stage compression pipeline for the YOLOv8 object detection model, integrating sparsity-aware training, structured channel pruning, and Channel-Wise Knowledge Distillation (CWD). First, sparsity-aware training introduces dynamic sparsity during model optimization, effectively balancing parameter reduction and detection accuracy. Second, we apply structured channel pruning by leveraging batch normalization scaling factors to eliminate redundant channels, significantly reducing model size and computational complexity. Finally, to mitigate the accuracy drop caused by pruning, we employ CWD to transfer knowledge from the original model, using an adjustable temperature and loss weighting scheme tailored for small and medium object detection. Extensive experiments on the VisDrone dataset demonstrate the effectiveness of our approach across multiple YOLOv8 variants. For YOLOv8m, our method reduces model parameters from 25.85M to 6.85M (a 73.51% reduction), FLOPs from 49.6G to 13.3G, and MACs from 101G to 34.5G, while reducing AP50 by only 2.7%. The resulting compressed model achieves 47.9 AP50 and boosts inference speed from 26 FPS (YOLOv8m baseline) to 45 FPS, enabling real-time deployment on edge devices. We further apply TensorRT as a lightweight optimization step. While this introduces a minor drop in AP50 (from 47.9 to 47.6), it significantly improves inference speed from 45 to 68 FPS, demonstrating the practicality of our approach for high-throughput, re-source-constrained scenarios.
翻译:在资源受限设备上高效部署用于航拍目标检测的深度学习模型,需要在保持性能的前提下实现显著的压缩。本研究针对YOLOv8目标检测模型,提出一种新颖的三阶段压缩流程,该流程融合了稀疏感知训练、结构化通道剪枝以及通道级知识蒸馏。首先,稀疏感知训练在模型优化过程中引入动态稀疏性,有效平衡了参数削减与检测精度。其次,我们利用批量归一化缩放因子实施结构化通道剪枝,以消除冗余通道,从而显著降低模型规模与计算复杂度。最后,为缓解剪枝导致的精度下降,我们采用通道级知识蒸馏,通过专为中小目标检测设计的可调温度与损失加权方案,从原始模型迁移知识。在VisDrone数据集上进行的大量实验验证了该方法在多种YOLOv8变体上的有效性。对于YOLOv8m模型,我们的方法将参数量从25.85M降至6.85M(减少73.51%),FLOPs从49.6G降至13.3G,MACs从101G降至34.5G,而AP50仅下降2.7%。所得压缩模型实现了47.9的AP50,并将推理速度从基准YOLOv8m的26 FPS提升至45 FPS,从而能够在边缘设备上实现实时部署。我们进一步应用TensorRT作为轻量级优化步骤。尽管这导致AP50略有下降(从47.9降至47.6),但推理速度从45 FPS显著提升至68 FPS,证明了我们的方法在高吞吐量、资源受限场景下的实用性。