目标导向的视觉对话是“视觉-语言”交叉领域中一个较新的任务,它要求机器能通过多轮对话完成视觉相关的特定目标。该任务兼具研究意义与应用价值。日前,北京邮电大学王小捷教授团队与美团AI平台NLP中心团队合作,在目标导向的视觉对话任务上的研究论文《Answer-Driven Visual State Estimator for Goal-Oriented Visual Dialogue-commentCZ》被国际多媒体领域顶级会议ACM MM2020录用。该论文分享了在目标导向视觉对话中的最新进展,即提出了一种响应驱动的视觉状态估计器(Answer-Driven Visual State Estimator,ADVSE)用于融合视觉对话中的对话历史信息和图片信息,其中的聚焦注意力机制(Answer-Driven Focusing Attention,ADFA)能有效强化响应信息,条件视觉信息融合机制(Conditional Visual Information Fusion,CVIF)用于自适应选择全局和差异信息。该估计器不仅可以用于生成问题,还可以用于回答问题。在视觉对话的国际公开数据集GuessWhat?!上的实验结果表明,该模型在问题生成和回答上都取得了当前的领先水平。
[1] FlorianStrub,HarmdeVries,JérémieMary,BilalPiot,AaronC.Courville,and Olivier Pietquin. 2017. End-to-end optimization of goal-driven and visually grounded dialogue systems. In Joint Conference on Artificial Intelligence.[2] Pushkar Shukla, Carlos Elmadjian, Richika Sharan, Vivek Kulkarni, Matthew Turk, and William Yang Wang. 2019. What Should I Ask? Using Conversationally Informative Rewards for Goal-oriented Visual Dialog.. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Association for ComputationalLinguistics,Florence,Italy,6442–6451. https://doi.org/10.18653/v1/P19-1646[3] JunjieZhang,QiWu,ChunhuaShen, JianZhang, JianfengLu, andAntonvanden Hengel. 2018. Goal-Oriented Visual Question Generation via Intermediate Re- wards. In Proceedings of the European Conference on Computer Vision.[4] Ehsan Abbasnejad, Qi Wu, Iman Abbasnejad, Javen Shi, and Anton van den Hengel. 2018. An Active Information Seeking Model for Goal-oriented Vision- and-Language Tasks. CoRR abs/1812.06398 (2018). arXiv:1812.06398 http://arxiv.org/abs/1812.06398.[5] EhsanAbbasnejad, QiWu,JavenShi, andAntonvandenHengel. 2018. What’sto Know? Uncertainty as a Guide to Asking Goal-Oriented Questions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4150–4159.[6] Chaorui Deng, Qi Wu, Qingyao Wu, Fuyuan Hu, Fan Lyu, and Mingkui Tan. 2018. Visual Grounding via Accumulated Attention. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7746–7755.[7] Tianhao Yang, Zheng-Jun Zha, and Hanwang Zhang. 2019. Making History Matter: History-Advantage Sequence Training for Visual Dialog. In Proceedings of the IEEE International Conference on Computer Vision. 2561–2569.
[8] BohanZhuang, QiWu, ChunhuaShen,IanD. Reid,andAntonvandenHengel. 2018. Parallel Attention: A Unified Framework for Visual Object Discovery Through Dialogs and Queries. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4252–4261.