SeqDialN:视觉和语言联合代表空间中的序列视觉对话网络 (SeqDialN: Sequential Visual Dialog Networks in Joint Visual-Linguistic Representation Space)

In this work, we formulate a visual dialog as an information flow in which each piece of information is encoded with the joint visual-linguistic representation of a single dialog round. Based on this formulation, we consider the visual dialog task as a sequence problem consisting of ordered visual-linguistic vectors. For featurization, we use a Dense Symmetric Co-Attention network as a lightweight vison-language joint representation generator to fuse multimodal features (i.e., image and text), yielding better computation and data efficiencies. For inference, we propose two Sequential Dialog Networks (SeqDialN): the first uses LSTM for information propagation (IP) and the second uses a modified Transformer for multi-step reasoning (MR). Our architecture separates the complexity of multimodal feature fusion from that of inference, which allows simpler design of the inference engine. IP based SeqDialN is our baseline with a simple 2-layer LSTM design that achieves decent performance. MR based SeqDialN, on the other hand, recurrently refines the semantic question/history representations through the self-attention stack of Transformer and produces promising results on the visual dialog task. On VisDial v1.0 test-std dataset, our best single generative SeqDialN achieves 62.54% NDCG and 48.63% MRR; our ensemble generative SeqDialN achieves 63.78% NDCG and 49.98% MRR, which set a new state-of-the-art generative visual dialog model. We fine-tune discriminative SeqDialN with dense annotations and boost the performance up to 72.41% NDCG and 55.11% MRR. In this work, we discuss the extensive experiments we have conducted to demonstrate the effectiveness of our model components. We also provide visualization for the reasoning process from the relevant conversation rounds and discuss our fine-tuning methods. Our code is available at https://github.com/xiaoxiaoheimei/SeqDialN

翻译：在这项工作中,我们设计了一个视觉对话,作为信息流,其中每部分信息都以单一对话框圆圈的联合视觉语言表达方式编码。基于此配方,我们认为视觉对话任务是一个序列问题,由订购视觉语言矢量组成。在编织过程中,我们使用一个常识对称共控网络,作为比语言联合代表引擎更轻的比重生成器。基于SeqendiDalN的模型将多功能(即,图像和文本)融合起来,产生更好的计算和数据效率。据推测,我们建议两个序列对称(SeqidiDal )网络(SeqidiDalal ) :首先将LSTM用于信息传播(IP),第二组使用经修改的变换式工具进行多步推理(MRMR) 。我们的架构将多功能融合与导力引擎的复杂设计分开。基于Seqidial N的智能模型和新功能设计是简单的2级对立式LSTM的基线。基于SqialN的Sqial N,另一手的Sqialal,经常地改进了Sqal-Dal-dealal-dealalalalal的显示,在Syalmatial-st的模型上,在Syalmax-st的模型上,在Sral-toal-toal-toal-toal-toal-toal-toal-toal-st sabal saildromadal sail sail saild saildal saild saild sail drodald sail sail sail sail sail sail sail drodal saildal sail drodal drodal madal drodaldal drodaldaldaldaldaldaldaldaldaldaldaldaldal madal madal madal madaldal madal madrodal madal madrodal saildaldaldaldald sail sail sail sail