Video Question Answering (VidQA) evaluation metrics have been limited to a single-word answer or selecting a phrase from a fixed set of phrases. These metrics limit the VidQA models' application scenario. In this work, we leverage semantic roles derived from video descriptions to mask out certain phrases, to introduce VidQAP which poses VidQA as a fill-in-the-phrase task. To enable evaluation of answer phrases, we compute the relative improvement of the predicted answer compared to an empty string. To reduce the influence of language bias in VidQA datasets, we retrieve a video having a different answer for the same question. To facilitate research, we construct ActivityNet-SRL-QA and Charades-SRL-QA and benchmark them by extending three vision-language models. We further perform extensive analysis and ablative studies to guide future work.
翻译:视频问答( VidQA) 评估指标限于单字回答或从一组固定的短语中选择一个短语。 这些参数限制了 VidQA 模型的应用设想。 在这项工作中, 我们利用视频描述中产生的语义作用来掩盖某些短语, 引入VidQAP, 将 VidQA 作为一种填充式任务。 为了能够对回答的短语进行评估, 我们计算出预测答案相对于空字符串的相对改进。 为了减少 VidQA 数据集中语言偏见的影响, 我们取回一个对同一问题有不同答案的视频。 为了便利研究, 我们构建了活动网- SRL- QA 和 Charaades-SRL- QA, 并通过扩展三个愿景语言模型来设定它们的基准。 我们进一步进行了广泛的分析和模拟研究, 以指导未来的工作 。