In this paper, we propose a novel spoken-text-style conversion method that can simultaneously execute multiple style conversion modules such as punctuation restoration and disfluency deletion without preparing matched datasets. In practice, transcriptions generated by automatic speech recognition systems are not highly readable because they often include many disfluencies and do not include punctuation marks. To improve their readability, multiple spoken-text-style conversion modules that individually model a single conversion task are cascaded because matched datasets that simultaneously handle multiple conversion tasks are often unavailable. However, the cascading is unstable against the order of tasks because of the chain of conversion errors. Besides, the computation cost of the cascading must be higher than the single conversion. To execute multiple conversion tasks simultaneously without preparing matched datasets, our key idea is to distinguish individual conversion tasks using the on-off switch. In our proposed zero-shot joint modeling, we switch the individual tasks using multiple switching tokens, enabling us to utilize a zero-shot learning approach to executing simultaneous conversions. Our experiments on joint modeling of disfluency deletion and punctuation restoration demonstrate the effectiveness of our method.
翻译:在本文中,我们建议一种新型的口头文字式转换方法,它可以在不准备匹配的数据集的情况下同时执行多样式转换模块,例如标点恢复和不流利的删除。 实际上,自动语音识别系统生成的转录并不高可读性,因为自动语音识别系统通常包含许多解脱,并不包含标点。为了提高可读性,多个单项模拟单个转换任务的多文本式转换模块是串联的,因为同时处理多个转换任务的匹配数据集往往无法使用。然而,由于转换错误链的顺序,串列与任务的顺序不稳。此外,这些递增的计算成本必须高于单项转换。要同时执行多个转换任务而不准备匹配的数据集,我们的关键想法是区分使用点点开关的单个转换任务。在我们提议的零光联合模型中,我们用多个切换符号转换单个任务,使我们能够使用零镜头的学习方法执行同步转换。我们关于混合模型删除和恢复方法的实验展示了我们的方法的有效性。