This paper presents the 6th place solution to the Google Universal Image Embedding competition on Kaggle. Our approach is based on the CLIP architecture, a powerful pre-trained model used to learn visual representation from natural language supervision. We also utilized the SubCenter ArcFace loss with dynamic margins to improve the distinctive power of class separability and embeddings. Finally, a diverse dataset has been created based on the test's set categories and the leaderboard's feedback. By carefully crafting a training scheme to enhance transfer learning, our submission scored 0.685 on the private leaderboard.
翻译:本文介绍了谷歌通用图像嵌入Kaggle第6个地方的解决方案。 我们的方法基于CLIP架构。 CLIP是一个强大的预培训模型,用于从自然语言监管中学习视觉表现。 我们还利用带有动态利润的亚中心Arcface损失来提高阶级分离和嵌入的独特能力。 最后,根据测试的设定类别和领导板的反馈,创建了一个多样化的数据集。 通过精心设计一个加强转移学习的培训计划,我们的提交书在私人领导板上获得了0.685分。