Sports analytics benefits from recent advances in machine learning providing a competitive advantage for teams or individuals. One important task in this context is the performance measurement of individual players to provide reports and log files for subsequent analysis. During sport events like basketball, this involves the re-identification of players during a match either from multiple camera viewpoints or from a single camera viewpoint at different times. In this work, we investigate whether it is possible to transfer the out-standing zero-shot performance of pre-trained CLIP models to the domain of player re-identification. For this purpose we reformulate the contrastive language-to-image pre-training approach from CLIP to a contrastive image-to-image training approach using the InfoNCE loss as training objective. Unlike previous work, our approach is entirely class-agnostic and benefits from large-scale pre-training. With a fine-tuned CLIP ViT-L/14 model we achieve 98.44 % mAP on the MMSports 2022 Player Re-Identification challenge. Furthermore we show that the CLIP Vision Transformers have already strong OCR capabilities to identify useful player features like shirt numbers in a zero-shot manner without any fine-tuning on the dataset. By applying the Score-CAM algorithm we visualise the most important image regions that our fine-tuned model identifies when calculating the similarity score between two images of a player.
翻译:体育分析受到机器学习的最新进展的影响,从而为团队或个人提供竞争优势。在这种情况下,一个重要的任务是对个人球员的表现进行测量,以提供报告和日志文件供后续分析。在篮球比赛等体育赛事中,这涉及到从多个摄像机视点或在不同时刻从单个摄像机视点重新识别球员。在这项工作中,我们研究了是否可能将预先训练的 CLIP 模型的杰出零样本性能转移到球员重新识别的领域。为此,我们将 CLIP 的对比语言到图像预训练方法重新格式化为使用 InfoNCE 损失作为训练目标的对比图像到图像训练方法。与以前的工作不同,我们的方法完全不涉及类别,并从大规模的预训练中受益。通过对精细调整的 CLIP ViT-L/14 模型进行了Fine-tune,我们在MMSports 2022 Player Re-Identification 挑战赛上实现了98.44%的mAP。此外,我们展示了 CLIP Vision Transformers 已经具有强大的 OCR 能力,可以在零样本的情况下识别出有用的球员特征,如衬衫号码,无需在数据集上进行任何精细调整。通过应用 Score-CAM 算法,我们可视化了我们精细调整的模型在计算两张球员图像之间的相似度分数时识别的最重要的图像区域。