Pretraining is a dominant paradigm in computer vision. Generally, supervised ImageNet pretraining is commonly used to initialize the backbones of person re-identification (Re-ID) models. However, recent works show a surprising result that CNN-based pretraining on ImageNet has limited impacts on Re-ID system due to the large domain gap between ImageNet and person Re-ID data. To seek an alternative to traditional pretraining, here we investigate semantic-based pretraining as another method to utilize additional textual data against ImageNet pretraining. Specifically, we manually construct a diversified FineGPR-C caption dataset for the first time on person Re-ID events. Based on it, a pure semantic-based pretraining approach named VTBR is proposed to adopt dense captions to learn visual representations with fewer images. We train convolutional neural networks from scratch on the captions of FineGPR-C dataset, and then transfer them to downstream Re-ID tasks. Comprehensive experiments conducted on benchmark datasets show that our VTBR can achieve competitive performance compared with ImageNet pretraining - despite using up to 1.4x fewer images, revealing its potential in Re-ID pretraining.
翻译:预设培训是计算机视觉中的主要模式。 一般来说, 受监督的图像网预设培训通常用于启动个人再识别(Re-ID)模型的骨干。 但是,最近的工作显示一个令人惊讶的结果,即有线电视新闻网在图像网上的预设培训对再识别系统的影响有限,因为图像网与人再识别数据之间存在巨大的领域差距。 为了寻找传统的预设培训的替代方法,我们在这里调查基于语义的预设培训,作为利用图像网预设培训的额外文本数据的另一个方法。 具体地说,我们首次手工构建了一个多样化的FineGPR-C标题数据集。 在此基础上,一个纯粹基于语义的预设培训方法,名为VTBR(VTBR),建议采用密集的字幕,以较少的图像学习视觉表现。 我们从零到 FineGPR-C数据集的图示,然后将其转移到下游再识别任务。 对基准数据集进行的全面实验表明,我们的VTBRBR能够取得与图像网前培训相比的竞争性性表现, 尽管使用了1.4x的图像, 暴露在REID训练前的潜力。