The Visual Question Answering (VQA) task utilizes both visual image and language analysis to answer a textual question with respect to an image. It has been a popular research topic with an increasing number of real-world applications in the last decade. This paper describes our recent research of AliceMind-MMU (ALIbaba's Collection of Encoder-decoders from Machine IntelligeNce lab of Damo academy - MultiMedia Understanding) that obtains similar or even slightly better results than human being does on VQA. This is achieved by systematically improving the VQA pipeline including: (1) pre-training with comprehensive visual and textual feature representation; (2) effective cross-modal interaction with learning to attend; and (3) A novel knowledge mining framework with specialized expert modules for the complex VQA task. Treating different types of visual questions with corresponding expertise needed plays an important role in boosting the performance of our VQA architecture up to the human level. An extensive set of experiments and analysis are conducted to demonstrate the effectiveness of the new research work.
翻译:视觉问题解答(VQA)任务利用视觉图像和语言分析来解答关于图像的文字问题,在过去十年中,这是一个很受欢迎的研究课题,其实际应用数量不断增加,本文描述了我们最近对AliceMind-MMU(Allibaba's Concretain of Encoder-decoders from Machine IntelligeNce Lab of Damo Colum-Multi Media Aunication)的研究,该研究在VQA上取得了与人类相似甚至略微更好的成果。这是通过系统地改进VQA管道实现的,其中包括:(1) 接受全面的视觉和文字特征说明培训前;(2) 与学习有关的有效的跨模式互动;(3) 一个具有复杂VQA任务专门专家模块的新的知识挖掘框架。用所需的相应专门知识处理不同种类的视觉问题,在提高我们VQA结构在人类层面的绩效方面发挥着重要作用。进行了广泛的实验和分析,以证明新的研究工作的有效性。