In this work we propose a multi-modal architecture for analyzing soccer scenes from tactical camera footage, with a focus on three core tasks: ball trajectory inference, ball state classification, and ball possessor identification. To this end, our solution integrates three distinct input modalities (player trajectories, player types and image crops of individual players) into a unified framework that processes spatial and temporal dynamics using a cascade of sociotemporal transformer blocks. Unlike prior methods, which rely heavily on accurate ball tracking or handcrafted heuristics, our approach infers the ball trajectory without direct access to its past or future positions, and robustly identifies the ball state and ball possessor under noisy or occluded conditions from real top league matches. We also introduce CropDrop, a modality-specific masking pre-training strategy that prevents over-reliance on image features and encourages the model to rely on cross-modal patterns during pre-training. We show the effectiveness of our approach on a large-scale dataset providing substantial improvements over state-of-the-art baselines in all tasks. Our results highlight the benefits of combining structured and visual cues in a transformer-based architecture, and the importance of realistic masking strategies in multi-modal learning.
翻译:本研究提出一种用于分析战术摄像机拍摄的足球场景的多模态架构,重点关注三项核心任务:足球轨迹推断、足球状态分类及持球者识别。为此,我们的解决方案将三种不同的输入模态(球员轨迹、球员类型及个体球员的图像裁剪区域)整合到统一框架中,该框架通过级联的社会时空Transformer块处理时空动态信息。与先前严重依赖精确足球追踪或手工启发式规则的方法不同,我们的方法无需直接获取足球的历史或未来位置即可推断其轨迹,并能在真实顶级联赛比赛的噪声或遮挡条件下稳健识别足球状态及持球者。我们还提出了CropDrop——一种针对特定模态的掩码预训练策略,该策略可防止模型过度依赖图像特征,并鼓励其在预训练期间利用跨模态模式。我们在大规模数据集上验证了所提方法的有效性,在所有任务中均较现有最优基线模型取得显著提升。研究结果突显了在基于Transformer的架构中结合结构化信息与视觉线索的优势,以及现实掩码策略在多模态学习中的重要性。