This paper introduces a novel approach to leverage features learned from both supervised and self-supervised paradigms, to improve image classification tasks, specifically for vehicle classification. Two state-of-the-art self-supervised learning methods, DINO and data2vec, were evaluated and compared for their representation learning of vehicle images. The former contrasts local and global views while the latter uses masked prediction on multi-layered representations. In the latter case, supervised learning is employed to finetune a pretrained YOLOR object detector for detecting vehicle wheels, from which definitive wheel positional features are retrieved. The representations learned from these self-supervised learning methods were combined with the wheel positional features for the vehicle classification task. Particularly, a random wheel masking strategy was utilized to finetune the previously learned representations in harmony with the wheel positional features during the training of the classifier. Our experiments show that the data2vec-distilled representations, which are consistent with our wheel masking strategy, outperformed the DINO counterpart, resulting in a celebrated Top-1 classification accuracy of 97.2% for classifying the 13 vehicle classes defined by the Federal Highway Administration.
翻译:本文介绍了一种利用从受监管和自监管模式中学习的特征的新办法,以利用从受监管和自监管模式中学习的特征,改进图像分类任务,特别是车辆分类,对两种最先进的自我监管学习方法(DINO和Data2vec)进行了评估和比较,以了解车辆图像的学习情况,前者对当地和全球观点进行了对比,而后者则对多层展示情况作了隐蔽的预测。在后一种情况下,采用受监管学习方法,对经过培训的YOLOR物体探测器进行微调,以探测车辆轮子,从中检索明确的轮子定位特征。从这些自监管学习方法中学习的情况与车辆分类任务的轮子定位特征相结合。特别是,采用随机车轮遮罩战略,对先前所学的表达方式进行微调,使其与叙级师培训期间的轮子位置特征相协调。我们的实验表明,数据2vec淡化的表达方式与我们的轮子遮挡战略相一致,超越了DINO对应方,从而得出了明确的轮子位置特征。