Building Spoken Language Understanding (SLU) systems that do not rely on language specific Automatic Speech Recognition (ASR) is an important yet less explored problem in language processing. In this paper, we present a comparative study aimed at employing a pre-trained acoustic model to perform SLU in low resource scenarios. Specifically, we use three different embeddings extracted using Allosaurus, a pre-trained universal phone decoder: (1) Phone (2) Panphone, and (3) Allo embeddings. These embeddings are then used in identifying the spoken intent. We perform experiments across three different languages: English, Sinhala, and Tamil each with different data sizes to simulate high, medium, and low resource scenarios. Our system improves on the state-of-the-art (SOTA) intent classification accuracy by approximately 2.11% for Sinhala and 7.00% for Tamil and achieves competitive results on English. Furthermore, we present a quantitative analysis of how the performance scales with the number of training examples used per intent.
翻译:建设不依赖特定语言自动语音识别(ASR)的语音理解(SLU)系统,是语言处理中一个重要的、但探索较少的问题。在本文中,我们提出一项比较研究,旨在使用预先训练的声学模型,在低资源情景下实施SLU。具体地说,我们使用三个不同的嵌入器,分别使用预先训练的通用电话解码器Allosaurus,即预先训练的通用电话解码器:(1)电话(2)Panphone,和(3)Allo嵌入器。这些嵌入器随后用于确定口语意图。我们用三种不同语言进行实验:英语、僧伽罗语和泰米尔语,每种语言的数据大小不同,以模拟高、中、低资源情景。我们的系统对Sohala(SOTA)目的分类精度做了大约2.11%的改进,对Sinhala语和泰米尔语的大约7.0%的精确度,对英语取得了竞争性结果。此外,我们用培训实例的数量分析如何进行。