While there have been significant gains in the field of automated video description, the generalization performance of automated description models to novel domains remains a major barrier to using these systems in the real world. Most visual description methods are known to capture and exploit patterns in the training data leading to evaluation metric increases, but what are those patterns? In this work, we examine several popular visual description datasets, and capture, analyze, and understand the dataset-specific linguistic patterns that models exploit but do not generalize to new domains. At the token level, sample level, and dataset level, we find that caption diversity is a major driving factor behind the generation of generic and uninformative captions. We further show that state-of-the-art models even outperform held-out ground truth captions on modern metrics, and that this effect is an artifact of linguistic diversity in datasets. Understanding this linguistic diversity is key to building strong captioning models, we recommend several methods and approaches for maintaining diversity in the collection of new data, and dealing with the consequences of limited diversity when using current models and metrics.
翻译:虽然在自动视频描述领域取得了显著进展,但将自动描述模型的通用性性能推广到新领域仍然是现实世界中使用这些系统的一个主要障碍。大多数视觉描述方法已知可以捕捉和利用导致评价指标增加的培训数据模式,但这些模式又是什么?在这项工作中,我们检查了几个流行的视觉描述数据集,并捕捉、分析和了解了模型所利用但并不推广到新领域的特定数据集语言模式。在象征性层面、样本层面和数据集层面,我们发现标题多样性是产生通用和非信息化说明的主要驱动因素。我们进一步显示,最先进的模型甚至超越了现代指标的实地掌握的真相说明,而这种效果是数据集中语言多样性的产物。了解这种语言多样性是建立强有力的字幕模型的关键,我们建议采取若干方法和办法,在收集新数据方面保持多样性,并在使用现有模型和指标时处理有限多样性的后果。