边缘设备高效图像说明 (Efficient Image Captioning for Edge Devices)

Recent years have witnessed the rapid progress of image captioning. However, the demands for large memory storage and heavy computational burden prevent these captioning models from being deployed on mobile devices. The main obstacles lie in the heavyweight visual feature extractors (i.e., object detectors) and complicated cross-modal fusion networks. To this end, we propose LightCap, a lightweight image captioner for resource-limited devices. The core design is built on the recent CLIP model for efficient image captioning. To be specific, on the one hand, we leverage the CLIP model to extract the compact grid features without relying on the time-consuming object detectors. On the other hand, we transfer the image-text retrieval design of CLIP to image captioning scenarios by devising a novel visual concept extractor and a cross-modal modulator. We further optimize the cross-modal fusion model and parallel prediction heads via sequential and ensemble distillations. With the carefully designed architecture, our model merely contains 40M parameters, saving the model size by more than 75% and the FLOPs by more than 98% in comparison with the current state-of-the-art methods. In spite of the low capacity, our model still exhibits state-of-the-art performance on prevalent datasets, e.g., 136.6 CIDEr on COCO Karpathy test split. Testing on the smartphone with only a single CPU, the proposed LightCap exhibits a fast inference speed of 188ms per image, which is ready for practical applications.

翻译：近些年来,图像字幕取得了快速的进展。然而,对大型存储存储存储和重计算负担的需求使得这些描述模型无法在移动设备上部署。主要障碍在于超重视觉特征提取器(即物体探测器)和复杂的跨模式聚合网络。为此,我们提议为资源有限的设备设计一个轻量图像说明器LightCap。核心设计建在最新的 CLIP 模型上, 用于高效图像描述。具体地说, 我们利用 CLIP 模型在不依赖耗时的物体探测器的情况下提取这些缩写网络功能。另一方面, 我们通过设计一个新的视觉概念提取器和复杂的跨模式聚合网络网络网络。我们提议为资源有限的设备进一步优化跨模式的轻量图像解释器。核心设计建在最新的 CLIP 模型中仅包含40M 参数, 将模型大小保存在75 % 以上, FLOPs 将 CLIP 图像检索设计到超过98 % 。与当前常规的C 标准测试模型相比, C- 仅包含简化的C 标准。