Dialogue act classification (DAC) is a critical task for spoken language understanding in dialogue systems. Prosodic features such as energy and pitch have been shown to be useful for DAC. Despite their importance, little research has explored neural approaches to integrate prosodic features into end-to-end (E2E) DAC models which infer dialogue acts directly from audio signals. In this work, we propose an E2E neural architecture that takes into account the need for characterizing prosodic phenomena co-occurring at different levels inside an utterance. A novel part of this architecture is a learnable gating mechanism that assesses the importance of prosodic features and selectively retains core information necessary for E2E DAC. Our proposed model improves DAC accuracy by 1.07% absolute across three publicly available benchmark datasets.
翻译:对话行为分类(DAC)是对话系统中口语理解的关键任务。诸如能量和声道等有据可查的特征已经证明对发援委有用。尽管这些特征很重要,但几乎没有研究探索神经学方法将预证特征纳入终端至终端(E2E)发援委模型,这些模型可以推断对话直接来自音频信号。在这项工作中,我们提议E2E神经结构,考虑到需要将发援委在言论中不同级别上共同出现的预证现象定性。这个结构的新部分是一个可学习的定位机制,评估预证特征的重要性,有选择地保留E2E发援委所需的核心信息。我们提出的模型将发援委在三种公开的基准数据集中的准确性提高了1.07%的绝对度。