Although recent approaches aiming for video instance segmentation have achieved promising results, it is still difficult to employ those approaches for real-world applications on mobile devices, which mainly suffer from (1) heavy computation and memory cost and (2) complicated heuristics for tracking objects. To address those issues, we present MobileInst, a lightweight and mobile-friendly framework for video instance segmentation on mobile devices. Firstly, MobileInst adopts a mobile vision transformer to extract multi-level semantic features and presents an efficient query-based dual-transformer instance decoder for mask kernels and a semantic-enhanced mask decoder to generate instance segmentation per frame. Secondly, MobileInst exploits simple yet effective kernel reuse and kernel association to track objects for video instance segmentation. Further, we propose temporal query passing to enhance the tracking ability for kernels. We conduct experiments on COCO and YouTube-VIS datasets to demonstrate the superiority of MobileInst and evaluate the inference latency on a mobile CPU core of Qualcomm Snapdragon-778G, without other methods of acceleration. On the COCO dataset, MobileInst achieves 30.5 mask AP and 176 ms on the mobile CPU, which reduces the latency by 50% compared to the previous SOTA. For video instance segmentation, MobileInst achieves 35.0 AP on YouTube-VIS 2019 and 30.1 AP on YouTube-VIS 2021. Code will be available to facilitate real-world applications and future research.
翻译:虽然最近针对视频实例分割的方法取得了很好的结果,但在移动设备上难以应用于实际应用,主要表现为:(1)计算和内存成本较高,(2)追踪对象的启发式策略比较复杂。为了解决这些问题,我们提出了 MobileInst,这是一个针对移动设备的轻量级和友好型视频实例分割框架。首先,MobileInst采用一个移动视觉变换器来提取多级语义特征,并提出了一种高效的基于查询的双变换器实例解码器来进行掩膜内核和语义增强掩膜解码器的查询,以生成每一帧的实例分割。其次,MobileInst利用简单但有效的核重用和核关联来跟踪对象进行视频实例分割。此外,我们提出了时间查询传递,以增强内核的跟踪能力。我们在COCO和YouTube-VIS数据集上进行了实验,以展示MobileInst的优越性,并评估了在高通骁龙778G移动CPU核上的推理延迟,不用其他的加速方法。在COCO数据集上,MobileInst实现了30.5个掩膜AP和176毫秒的移动CPU核延迟,相比之前的最佳结果减少了50%。针对视频实例分割,MobileInst在YouTube-VIS 2019上实现了35.0的AP,在YouTube-VIS 2021上实现了30.1的AP。代码将提供以便于实际应用和未来研究。