We discuss problems with the standard approaches to evaluation for tasks like visual question answering, and argue that artificial data can be used to address these as a complement to current practice. We demonstrate that with the help of existing 'deep' linguistic processing technology we are able to create challenging abstract datasets, which enable us to investigate the language understanding abilities of multimodal deep learning models in detail, as compared to a single performance value on a static and monolithic dataset.
翻译:我们讨论了对视觉问题回答等任务的标准评价方法的问题,认为人工数据可以用来解决这些问题,作为当前做法的补充。 我们证明,在现有的“深入”语言处理技术的帮助下,我们能够创建具有挑战性的抽象数据集,从而使我们能够详细调查多式深层次学习模式的语言理解能力,而不是静态和单一数据集的单一性能价值。