Most research on task oriented dialog modeling is based on written text input. However, users interact with practical dialog systems often using speech as input. Typically, systems convert speech into text using an Automatic Speech Recognition (ASR) system, introducing errors. Furthermore, these systems do not address the differences in written and spoken language. The research on this topic is stymied by the lack of a public corpus. Motivated by these considerations, our goal in hosting the speech-aware dialog state tracking challenge was to create a public corpus or task which can be used to investigate the performance gap between the written and spoken forms of input, develop models that could alleviate this gap, and establish whether Text-to-Speech-based (TTS) systems is a reasonable surrogate to the more-labor intensive human data collection. We created three spoken versions of the popular written-domain MultiWoz task -- (a) TTS-Verbatim: written user inputs were converted into speech waveforms using a TTS system, (b) Human-Verbatim: humans spoke the user inputs verbatim, and (c) Human-paraphrased: humans paraphrased the user inputs. Additionally, we provided different forms of ASR output to encourage wider participation from teams that may not have access to state-of-the-art ASR systems. These included ASR transcripts, word time stamps, and latent representations of the audio (audio encoder outputs). In this paper, we describe the corpus, report results from participating teams, provide preliminary analyses of their results, and summarize the current state-of-the-art in this domain.
翻译:有关以任务为导向的对话模型的研究大多以书面文本输入为基础。然而,用户与实际对话系统互动,通常使用语音作为输入。通常,系统使用自动语音识别(ASR)系统将语音转换成文本,引入错误。此外,这些系统并不解决书面和口头语言的差异。关于这个专题的研究因缺乏公共资料而受阻。受这些考虑的驱动,我们主办有感言对话状态跟踪挑战的目的是建立一个公共资料库或任务,可用于调查书面和口头投入形式之间的性差,开发能够缩小这一差距的模式,并确定基于语言的语音识别(TTS)系统是否合理替代了更拉动的语言和口头语言的人类数据收集。我们创建了三种通用的书面文件库-多功能任务(a) TTS-Verbatim:书面用户投入已用TTS系统转换成音频波表,(b) 人文:人文对用户的逐字记录,以及(c) 人文-Speople-Slational-Slational reports, 提供了我们使用的A-al-sal-sal-laverial ex 和我们使用的A-ex-serviews-serviewserviews-s-s-s-s-s-servidududududududududududududududududududududeddddddd), 提供了本Addddddddddddddal-salpalpalddddddddddddddddddddddddddddalddddddddddddddddddddddddddddddsaldddaldddaldddddaldaldaldddaldaldaldaldal 。 。 。 提供这些系统。 提供。提供。提供。提供。提供这些文件,提供。提供。提供。提供。提供这些系统,提供了这些系统,提供这种文件,提供了这些文件,提供了这些文件,提供这种文件,提供了这些系统,提供了这些版本。