This paper presents a new approach to visual zero-shot slot filling. The approach extends previous approaches by reformulating the slot filling task as Question Answering. Slot tags are converted to rich natural language questions that capture the semantics of visual information and lexical text on the GUI screen. These questions are paired with the user's utterance and slots are extracted from the utterance using a state-of-the-art ALBERT-based Question Answering system trained on the Stanford Question Answering dataset (SQuaD2). An approach to further refine the model with multi-task training is presented. The multi-task approach facilitates the incorporation of a large number of successive refinements and transfer learning across similar tasks. A new Visual Slot dataset and a visual extension of the popular ATIS dataset is introduced to support research and experimentation on visual slot filling. Results show F1 scores between 0.52 and 0.60 on the Visual Slot and ATIS datasets with no training data (zero-shot).
翻译:本文展示了一种新的视觉零射空格填充方法。 这种方法通过将空档填充任务重订为问答来扩展先前的方法。 将空格标签转换为丰富的自然语言问题, 以捕捉图形界面屏幕上的视觉信息和词汇文本的语义。 这些问题与用户的语义相配, 并且用在斯坦福问答数据集( SuaD2) 上培训的基于最新水平的ALBERT 的问答系统从语句中提取空格。 提供了一种通过多任务培训进一步完善模型的方法。 多任务方法有助于将大量连续的改进和转移学习纳入类似任务。 引入了新的视觉斯洛特数据集和广受欢迎的ATIS数据集的视觉扩展, 以支持视觉空格填充的研究和实验。 结果显示,在没有培训数据的情况下, 视觉空格和ATIS 数据集的F1分数在0.52至0.60之间, 显示F1分在0. 5至0.60之间( 零点) 。