Conventional conversation assistants extract text transcripts from the speech signal using automatic speech recognition (ASR) and then predict intent from the transcriptions. Using end-to-end spoken language understanding (SLU), the intents of the speaker are predicted directly from the speech signal without requiring intermediate text transcripts. As a result, the model can optimize directly for intent classification and avoid cascading errors from ASR. The end-to-end SLU system also helps in reducing the latency of the intent prediction model. Although many datasets are available publicly for text-to-intent tasks, the availability of labeled speech-to-intent datasets is limited, and there are no datasets available in the Indian accent. In this paper, we release the Skit-S2I dataset, the first publicly available Indian-accented SLU dataset in the banking domain in a conversational tonality. We experiment with multiple baselines, compare different pretrained speech encoder's representations, and find that SSL pretrained representations perform slightly better than ASR pretrained representations lacking prosodic features for speech-to-intent classification. The dataset and baseline code is available at \url{https://github.com/skit-ai/speech-to-intent-dataset}
翻译:常规对话助理用自动语音识别(ASR)从语音信号中提取文字记录誊本,然后从抄录中预测意向。使用端到端口语理解(SLU),发言者的意图直接从语音信号中预测,而不需要中间文本记录。因此,该模型可以直接优化意图分类,避免ASR的分层错误。端到端 SLU系统还有助于减少意图预测模型的延迟性。虽然许多数据集可供公开用于文本到意向任务,但标签的语音到意向数据集有限,印度口音中没有提供数据集。在本文中,我们发布了Skit-S2I数据集,这是银行领域第一个公开提供的印度文的SLU数据集。我们试验了多个基线,比较了不同的经过事先训练的语音编码,发现SLS在缺乏语音到端口语/方数据分类的预培训演示中比ASR预选的表示略好一点。数据设置基线和代码是可用的。