In this paper, we present our solution, which placed 5th in the kaggle Google Universal Image Embedding Competition in 2022. We use the ViT-H visual encoder of CLIP from the openclip repository as a backbone and train a head model composed of BatchNormalization and Linear layers using ArcFace. The dataset used was a subset of products10K, GLDv2, GPR1200, and Food101. And applying TTA for part of images also improves the score. With this method, we achieve a score of 0.684 on the public and 0.688 on the private leaderboard. Our code is available. https://github.com/riron1206/kaggle-Google-Universal-Image-Embedding-Competition-5th-Place-Solution
翻译:在本文中,我们展示了我们的解决方案,它把第五位放到了2022年的卡格格谷歌通用图像嵌入竞争中。我们用开放剪切库的 CLIP VT-H 视觉编码器作为主干线,并用ArcFace 来训练由批发热化和线形层组成的头型模型。使用的数据集是产品10K、GLDv2、GPR1200和Food101. 的一个子集。对部分图像应用TTA也改善了得分。用这种方法,我们取得了公众0.684分和私人领头板0.688分的分数。我们的代码是:https://github.com/riron1206/kagle-Google-Universal-Image-Embededding-Competition-5-Place-Solution。