Reference resolution, which aims to identify entities being referred to by a speaker, is more complex in real world settings: new referents may be created by processes the agents engage in and/or be salient only because they belong to the shared physical setting. Our focus is on resolving references to visualizations on a large screen display in multimodal dialogue; crucially, reference resolution is directly involved in the process of creating new visualizations. We describe our annotations for user references to visualizations appearing on a large screen via language and hand gesture and also new entity establishment, which results from executing the user request to create a new visualization. We also describe our reference resolution pipeline which relies on an information-state architecture to maintain dialogue context. We report results on detecting and resolving references, effectiveness of contextual information on the model, and under-specified requests for creating visualizations. We also experiment with conventional CRF and deep learning / transformer models (BiLSTM-CRF and BERT-CRF) for tagging references in user utterance text. Our results show that transfer learning significantly boost performance of the deep learning methods, although CRF still out-performs them, suggesting that conventional methods may generalize better for low resource data.
翻译:在现实世界环境中,旨在确定发言者所提及实体的参考分辨率更为复杂:代理人参与和(或)突出的参考文献可能通过属于共同物理环境的过程产生,我们的重点是在多式联运对话中大型屏幕显示中解决可视化的参考文献;关键的是,参考文献分辨率直接涉及创建新可视化的过程;我们描述用户通过语言和手势以及新的实体机构在大屏幕上出现的可视化文献的参考文献说明,这些文献来源是执行用户要求创建新的可视化;我们还描述我们的参考文献管道,依赖信息状态结构来维持对话环境;我们报告在发现和解决参考文献、模型背景信息的有效性以及创建可视化请求不足方面的结果;我们还试验常规通用报告格式和深层学习/变异模型(BILSTM-CRF和BERT-CRF),在用户感知文本中标注参考文献;我们的结果显示,转移知识极大地促进了深层学习方法的绩效,尽管通用报告格式仍然超越了它们,但表明常规方法可以更好地普及低资源。