Text normalization (TN) and inverse text normalization (ITN) are essential preprocessing and postprocessing steps for text-to-speech synthesis and automatic speech recognition, respectively. Many methods have been proposed for either TN or ITN, ranging from weighted finite-state transducers to neural networks. Despite their impressive performance, these methods aim to tackle only one of the two tasks but not both. As a result, in a complete spoken dialog system, two separate models for TN and ITN need to be built. This heterogeneity increases the technical complexity of the system, which in turn increases the cost of maintenance in a production setting. Motivated by this observation, we propose a unified framework for building a single neural duplex system that can simultaneously handle TN and ITN. Combined with a simple but effective data augmentation method, our systems achieve state-of-the-art results on the Google TN dataset for English and Russian. They can also reach over 95% sentence-level accuracy on an internal English TN dataset without any additional fine-tuning. In addition, we also create a cleaned dataset from the Spoken Wikipedia Corpora for German and report the performance of our systems on the dataset. Overall, experimental results demonstrate the proposed duplex text normalization framework is highly effective and applicable to a range of domains and languages
翻译:文本正常化(TN)和反正文本正常化(ITN)是文本到语音合成和自动语音识别的必要预处理和后处理步骤,分别为文本到语音合成和自动语音识别。许多方法已经为TN或ITN提出了许多方法,从加权定点转换器到神经网络等。尽管这些方法的表现令人印象深刻,但它们只针对两项任务中的一个,而不是同时针对两种任务。因此,在完全的语音对话系统中,需要为TN和ITN建立两种不同的模型。这种差异增加了该系统的技术复杂性,从而增加了一个生产设置的维护费用。我们受这一观察的启发,提出了建立一个单一的神经双面系统的统一框架,这个系统可以同时处理TNN和ITN。结合一个简单而有效的数据增强方法,我们的系统可以在英语和俄语的Google TN数据集上取得最先进的结果。它们也可以在内部的英语TN数据集上达到95%以上的句级准确度,而无需任何进一步的微调。此外,我们还从Spokeen 版的Sconial commodial Exal 和德国文本的拟议版本系统。