Camera movement classification (CMC) models trained on contemporary, high-quality footage often degrade when applied to archival film, where noise, missing frames, and low contrast obscure motion cues. We bridge this gap by assembling a unified benchmark that consolidates two modern corpora into four canonical classes and restructures the HISTORIAN collection into five balanced categories. Building on this benchmark, we introduce DGME-T, a lightweight extension to the Video Swin Transformer that injects directional grid motion encoding, derived from optical flow, via a learnable and normalised late-fusion layer. DGME-T raises the backbone's top-1 accuracy from 81.78% to 86.14% and its macro F1 from 82.08% to 87.81% on modern clips, while still improving the demanding World-War-II footage from 83.43% to 84.62% accuracy and from 81.72% to 82.63% macro F1. A cross-domain study further shows that an intermediate fine-tuning stage on modern data increases historical performance by more than five percentage points. These results demonstrate that structured motion priors and transformer representations are complementary and that even a small, carefully calibrated motion head can substantially enhance robustness in degraded film analysis. Related resources are available at https://github.com/linty5/DGME-T.
翻译:在当代高质量视频片段上训练的摄像机运动分类(CMC)模型,当应用于档案影片时,其性能通常会下降,因为档案影片中的噪声、缺失帧和低对比度会掩盖运动线索。我们通过构建一个统一的基准来弥合这一差距,该基准将两个现代语料库整合为四个规范类别,并将HISTORIAN数据集重组为五个平衡的类别。基于此基准,我们提出了DGME-T,这是对Video Swin Transformer的一个轻量级扩展。它通过一个可学习且归一化的后期融合层,注入源自光流的方向性网格运动编码。DGME-T将主干网络在现代视频片段上的Top-1准确率从81.78%提升至86.14%,宏观F1分数从82.08%提升至87.81%;同时,在具有挑战性的二战影片上,准确率从83.43%提升至84.62%,宏观F1分数从81.72%提升至82.63%。一项跨领域研究进一步表明,在现代数据上进行中间微调阶段,可将历史影片上的性能提升超过五个百分点。这些结果表明,结构化的运动先验知识与Transformer表征是互补的,即使是一个经过精心校准的小型运动头也能显著增强在退化影片分析中的鲁棒性。相关资源可在 https://github.com/linty5/DGME-T 获取。