Vision-Language Pretraining (VLP) models have recently successfully facilitated many cross-modal downstream tasks. Most existing works evaluated their systems by comparing the fine-tuned downstream task performance. However, only average downstream task accuracy provides little information about the pros and cons of each VLP method, let alone provides insights on how the community can improve the systems in the future. Inspired by the CheckList for testing natural language processing, we introduce VL-CheckList, a novel framework to understand the capabilities of VLP models. The proposed method divides the image-texting ability of a VLP model into three categories: objects, attributes, and relations, and uses a novel taxonomy to further break down these three aspects. We conduct comprehensive studies to analyze seven recently popular VLP models via the proposed framework. Results confirm the effectiveness of the proposed method by revealing fine-grained differences among the compared models that were not visible from downstream task-only evaluation. Further results show promising research direction in building better VLP models. Data and Code: https://github.com/om-ai-lab/VL-CheckList
翻译:视觉语言预演模型最近成功地促进了许多跨模式的下游任务,大多数现有作品通过比较微调的下游任务业绩对其系统进行了评估,然而,只有平均下游任务精确度很少提供关于每种VLP方法的利弊的信息,更不用说对社区今后如何改进系统的洞察力了。在测试自然语言处理的校验列表的启发下,我们引入VL-CheckList,这是一个了解VLP模型能力的新框架。拟议方法将VLP模型的文字能力分为三类:对象、属性和关系,并使用新颖的分类法进一步打破这三个方面。我们进行全面研究,通过拟议的框架分析最近流行的七种VLP模型。结果通过揭示下游任务性评价无法见的对比模型之间的细微差异,证实了拟议方法的有效性。进一步的结果显示,在建立更好的VLP模型方面,研究方向很有希望。数据和代码:https://github.com/om-ai-lab/VL-Crystist:https://githrob.com/om-lab/VL-Lasist/List。