Pretraining is a dominant paradigm in computer vision. Generally, supervised ImageNet pretraining is commonly used to initialize the backbones of person re-identification (Re-ID) models. However, recent works show a surprising result that ImageNet pretraining has limited impacts on Re-ID system due to the large domain gap between ImageNet and person Re-ID data. To seek an alternative to traditional pretraining, we manually construct a diversified FineGPR-C caption dataset for the first time on person Re-ID events. Based on it, we propose a pure semantic-based pretraining approach named VTBR, which uses dense captions to learn visual representations with fewer images. Specifically, we train convolutional networks from scratch on the captions of FineGPR-C dataset, and transfer them to downstream Re-ID tasks. Comprehensive experiments conducted on benchmarks show that our VTBR can achieve competitive performance compared with ImageNet pretraining -- despite using up to 1.4x fewer images, revealing its potential in Re-ID pretraining.
翻译:培训前是计算机视野中的主要模式。 一般来说, 受监督的图像网络预培训通常用于启动人的再识别(Re-ID)模型的骨干。 但是,最近的工作显示一个令人惊讶的结果,即图像网络预培训由于图像网络与人再识别(Re-ID)数据之间的巨大领域差距,对再识别系统的影响有限。 为了寻找传统的预培训的替代方法,我们首次手工构建了一个关于人再识别(Re-ID)事件的多样化的FinalGPR-C标题数据集。 在此基础上,我们提出了一个纯粹的语义学预培训方法,即VTBR, 使用密集的字幕学习图像的视觉表现。 具体地说,我们从零开始对FineGPR-C数据集的描述进行革命网络培训,并将它们转移到下游再识别任务。 对基准进行的全面实验表明,我们的VTBR能够取得与图像网络预培训相比的竞争性业绩 -- 尽管使用了1.4x的图像,在再开发前培训中显示出其潜力。