Existing Real-Time Object Detection (RTOD) methods commonly adopt YOLO-like architectures for their favorable trade-off between accuracy and speed. However, these models rely on static dense computation that applies uniform processing to all inputs, misallocating representational capacity and computational resources such as over-allocating on trivial scenes while under-serving complex ones. This mismatch results in both computational redundancy and suboptimal detection performance. To overcome this limitation, we propose YOLO-Master, a novel YOLO-like framework that introduces instance-conditional adaptive computation for RTOD. This is achieved through a Efficient Sparse Mixture-of-Experts (ES-MoE) block that dynamically allocates computational resources to each input according to its scene complexity. At its core, a lightweight dynamic routing network guides expert specialization during training through a diversity enhancing objective, encouraging complementary expertise among experts. Additionally, the routing network adaptively learns to activate only the most relevant experts, thereby improving detection performance while minimizing computational overhead during inference. Comprehensive experiments on five large-scale benchmarks demonstrate the superiority of YOLO-Master. On MS COCO, our model achieves 42.4% AP with 1.62ms latency, outperforming YOLOv13-N by +0.8% mAP and 17.8% faster inference. Notably, the gains are most pronounced on challenging dense scenes, while the model preserves efficiency on typical inputs and maintains real-time inference speed. Code will be available.
翻译:现有的实时目标检测方法通常采用类YOLO架构,以在精度与速度间取得良好平衡。然而,这些模型依赖于静态密集计算,对所有输入进行统一处理,导致表征能力与计算资源的错配分配——例如在简单场景中过度分配资源,而在复杂场景中分配不足。这种不匹配既造成计算冗余,也导致检测性能欠佳。为克服此局限,我们提出YOLO-Master,一种新颖的类YOLO框架,其通过实例条件自适应计算机制实现实时目标检测。该机制通过高效稀疏专家混合模块实现,可根据输入场景的复杂度动态分配计算资源。其核心在于一个轻量级动态路由网络,该网络在训练期间通过多样性增强目标引导专家专业化,促进专家间形成互补性专长。此外,路由网络能自适应地学习仅激活最相关的专家,从而在提升检测性能的同时最小化推理时的计算开销。在五个大规模基准测试上的综合实验验证了YOLO-Master的优越性。在MS COCO数据集上,我们的模型以1.62毫秒延迟取得42.4%的平均精度,较YOLOv13-N提升0.8%平均精度且推理速度加快17.8%。值得注意的是,该模型在挑战性密集场景中提升尤为显著,同时在典型输入上保持高效,并维持实时推理速度。代码将公开提供。