Physical processes, camera movement, and unpredictable environmental conditions like the presence of dust can induce noise and artifacts in video feeds. We observe that popular unsupervised MOT methods are dependent on noise-free inputs. We show that the addition of a small amount of artificial random noise causes a sharp degradation in model performance on benchmark metrics. We resolve this problem by introducing a robust unsupervised multi-object tracking (MOT) model: AttU-Net. The proposed single-head attention model helps limit the negative impact of noise by learning visual representations at different segment scales. AttU-Net shows better unsupervised MOT tracking performance over variational inference-based state-of-the-art baselines. We evaluate our method in the MNIST-MOT and the Atari game video benchmark. We also provide two extended video datasets: ``Kuzushiji-MNIST MOT'' which consists of moving Japanese characters and ``Fashion-MNIST MOT'' to validate the effectiveness of the MOT models.
翻译:我们观察到,流行的不受监督的MOT方法取决于无噪音输入。我们表明,增加少量人为随机噪音会导致基准度模型性能的急剧退化。我们通过引入一个强大的、不受监督的多物体跟踪模型(MOT)来解决这个问题:AttU-Net。拟议的单一关注模型通过在不同段段级上学习视觉表现来帮助限制噪音的负面影响。AttU-Net显示,对基于变异推断的最先进的基线进行不受监督的MOT跟踪的性能更好。我们在MNIST-MOT和Atari游戏视频基准中评估了我们的方法。我们还提供了两个扩大的视频数据集:“Kuzushi-MNIST MOT”由移动的日本字符和“Afashion-MNIST MOT” 模型组成,以验证MOT模型的有效性。