This work aims at generating captions for soccer videos using deep learning. In this context, this paper introduces a dataset, model, and triple-level evaluation. The dataset consists of 22k caption-clip pairs and three visual features (images, optical flow, inpainting) for ~500 hours of \emph{SoccerNet} videos. The model is divided into three parts: a transformer learns language, ConvNets learn vision, and a fusion of linguistic and visual features generates captions. The paper suggests evaluating generated captions at three levels: syntax (the commonly used evaluation metrics such as BLEU-score and CIDEr), meaning (the quality of descriptions for a domain expert), and corpus (the diversity of generated captions). The paper shows that the diversity of generated captions has improved (from 0.07 reaching 0.18) with semantics-related losses that prioritize selected words. Semantics-related losses and the utilization of more visual features (optical flow, inpainting) improved the normalized captioning score by 28\%. The web page of this work: https://sites.google.com/view/soccercaptioning}{https://sites.google.com/view/soccercaptioning
翻译:这项工作旨在利用深层学习为足球视频制作字幕。 在此背景下, 本文引入了一个数据集、 模型和三层评价。 数据集由22k 字幕- 剪辑配对和三种视觉特征( 图像、 光学流、 编绘) 组成, 用于 ~ 500 小时 的 emph{ SoccerNet} 视频 。 模型分为三个部分: 变压器学习语言, ConvNets 学习视觉, 语言和视觉特征的融合产生字幕。 本文建议在三个层次上评价生成的字幕: 组合( 常用的评价指标, 如 BLEU- 核心和 CIDer ), 含义( 域专家描述的质量) 和 集( 生成的字幕的多样性) 。 显示生成的字幕的多样性已经得到改善( 从 0.07 到 0. 18 ), 与语义相关的损失, 以及更多视觉特征的利用( 光学流, ) 改进了 28 的标准化字幕评分数 。 。 网页 。 地图/ 网页 。 。 / 地图 / capregiew 。 。 。 / cregion/ sion/ 。 。