In spoken question answering, the systems are designed to answer questions from contiguous text spans within the related speech transcripts. However, the most natural way that human seek or test their knowledge is via human conversations. Therefore, we propose a new Spoken Conversational Question Answering task (SCQA), aiming at enabling the systems to model complex dialogue flows given the speech documents. In this task, our main objective is to build the system to deal with conversational questions based on the audio recordings, and to explore the plausibility of providing more cues from different modalities with systems in information gathering. To this end, instead of directly adopting automatically generated speech transcripts with highly noisy data, we propose a novel unified data distillation approach, DDNet, which effectively ingests cross-modal information to achieve fine-grained representations of the speech and language modalities. Moreover, we propose a simple and novel mechanism, termed Dual Attention, by encouraging better alignments between audio and text to ease the process of knowledge transfer. To evaluate the capacity of SCQA systems in a dialogue-style interaction, we assemble a Spoken Conversational Question Answering (Spoken-CoQA) dataset with more than 40k question-answer pairs from 4k conversations. The performance of the existing state-of-the-art methods significantly degrade on our dataset, hence demonstrating the necessity of cross-modal information integration. Our experimental results demonstrate that our proposed method achieves superior performance in spoken conversational question answering tasks.
翻译:在口头回答中,这些系统的设计是用来回答相关语音记录中来自相连文本中的问题。然而,人类寻求或测试其知识的最自然的方式是人类对话。因此,我们提出一个新的口语问答回答任务(SCQA),目的是使系统能够根据演讲文件来模拟复杂的对话流。在这项任务中,我们的主要目标是建立系统,处理基于录音的谈话问题,并探索在信息收集系统中从不同模式提供更多提示的可能性。为此,我们不直接采用以高度吵闹的数据直接生成的语音记录,而是提出一个新的统一的数据蒸馏方法(DDNet),有效地利用跨模式信息,实现语音和语言模式的精细化表达。此外,我们提出一个简单和新颖的机制,称为“双重注意”,通过鼓励加强音频和文字之间的调,以方便知识传输过程。为了评价SSCQA系统在对话式互动中的能力,我们不直接采用由高吵闹的口音调生成的语音记录,而是提出一个新的统一的数据蒸馏方法,即有效地利用跨模式,我们现有的数据整合方法,从而明显地展示我们目前的数据-正解的对口号。