Training a robust Speech to Text (STT) system requires tens of thousands of hours of data. Variabilities present in the dataset such as unwanted nuisances (environmental noise, etc) and biases (accent, gender, age, etc) are reasons for the need of large datasets to learn general representations, which is often not feasible for low resource languages. In many computer vision tasks, a recently proposed adversarial forgetting approach to remove unwanted features has produced good results. This motivates us to study the effect of de-entangling the accent information from the input speech signal while training STT systems. To this end, we use an information bottleneck architecture based on adversarial forgetting. This training scheme aims to enforce the model to learn general accent invariant speech representations. Two STT models trained on just 20 hrs of audio, with and without adversarial forgetting, are tested on two unseen accents not present in the training set. The results favour the adversarial forgetting scheme with an absolute average improvement of 6\% over the standard training scheme. Furthermore, we also observe an absolute improvement of 5.5\% when tested on the seen accents present in the training set.
翻译:强力的文字演讲(STT)系统需要数万小时的数据。 数据集中存在的不想要的干扰(环境噪音等)和偏见(强烈、性别、年龄等)等差异是需要大型数据集来学习一般表达方式的原因,而对于低资源语言来说,这些数据往往不可行。 在许多计算机愿景任务中,最近提出的消除不想要的特征的对抗式忘却方法产生了良好的结果。这促使我们在培训STT系统时研究从输入演讲信号中脱钩口音信息的效果。 为此,我们使用基于对抗性遗忘的信息瓶颈结构。本培训计划的目的是执行学习一般口头表达方式的模式。仅用20小时的音频、有对抗性遗忘或不对抗性遗忘方式培训的两个STT模型在培训组合中没有出现的两种看不见的口音上进行了测试。结果有利于对抗性遗忘计划,比标准培训计划得到绝对平均6 ⁇ 的改进。此外,我们还观察到,在对培训组合中显示的口音进行测试时,5.5 ⁇ 的绝对改进。