Video question answering has recently received a lot of attention from multimodal video researchers. Most video question answering datasets are usually in the form of multiple-choice. But, the model for the multiple-choice task does not infer the answer. Rather it compares the answer candidates for picking the correct answer. Furthermore, it makes it difficult to extend to other tasks. In this paper, we challenge the existing multiple-choice video question answering by changing it to open-ended video question answering. To tackle open-ended question answering, we use the pretrained GPT2 model. The model is fine-tuned with video inputs and subtitles. An ablation study is performed by changing the existing DramaQA dataset to an open-ended question answering, and it shows that performance can be improved using video metadata.
翻译:视频解答最近引起了多式视频视频研究人员的极大关注。 大部分视频解答数据集通常以多种选择的形式出现。 但是, 多重选择任务模式并不推导答案。 它比较了选择正确答案的答案对象。 此外, 它使得很难扩大到其他任务。 在本文中, 我们质疑现有的多选项视频解答问题, 将它改为开放式视频解答。 要回答开放式问题, 我们使用预先培训的 GPT2 模式。 该模式使用视频输入和字幕进行微调。 通过将现有的 DramaQA 数据集改为开放式问题解答, 并显示使用视频元数据可以改进业绩 。