In multimodal assistant, where vision is also one of the input modalities, the identification of user intent becomes a challenging task as visual input can influence the outcome. Current digital assistants take spoken input and try to determine the user intent from conversational or device context. So, a dataset, which includes visual input (i.e. images or videos for the corresponding questions targeted for multimodal assistant use cases, is not readily available. The research in visual question answering (VQA) and visual question generation (VQG) is a great step forward. However, they do not capture questions that a visually-abled person would ask multimodal assistants. Moreover, many times questions do not seek information from external knowledge. In this paper, we provide a new dataset, MMIU (MultiModal Intent Understanding), that contains questions and corresponding intents provided by human annotators while looking at images. We, then, use this dataset for intent classification task in multimodal digital assistant. We also experiment with various approaches for combining vision and language features including the use of multimodal transformer for classification of image-question pairs into 14 intents. We provide the benchmark results and discuss the role of visual and text features for the intent classification task on our dataset.
翻译:在多式联运助理中,如果视野也是投入模式之一,那么确定用户意图就成为一项具有挑战性的任务,因为视觉输入会影响结果。当前的数字助理采用口头输入,试图从谈话或设备背景中确定用户意图。因此,一个数据集,包括视觉输入(即针对多式联运助理使用案例的相应问题的图像或视频),并不容易获得。视觉回答(VQA)和视觉问题生成(VQG)的研究是向前迈出的一大步。然而,它们并不捕捉视觉化的人会要求多式联运助理的问题。此外,许多次的问题并不寻求外部知识的信息。在本文件中,我们提供了一个新的数据集,即MMIU(MultiModal Intent Access),其中载有由人类标识员在查看图像时提供的问题和相应意图。然后,我们用这个数据集来在多式联运数字助理中进行意图分类任务。我们还试验各种办法,将视觉和语言特征相结合,包括使用多式联运变异器将图像配对分类为14种意图。我们提供基准结果,并讨论视觉和文字特征在图像和文字分类任务中的作用。