In general, speech processing models consist of a language model along with an acoustic model. Regardless of the language model's complexity and variants, three critical pre-processing steps are needed in language models: cleaning, normalization, and tokenization. Among mentioned steps, the normalization step is so essential to format unification in pure textual applications. However, for embedded language models in speech processing modules, normalization is not limited to format unification. Moreover, it has to convert each readable symbol, number, etc., to how they are pronounced. To the best of our knowledge, there is no Persian normalization toolkits for embedded language models in speech processing modules, So in this paper, we propose an open-source normalization toolkit for text processing in speech applications. Briefly, we consider different readable Persian text like symbols (common currencies, #, @, URL, etc.), numbers (date, time, phone number, national code, etc.), and so on. Comparison with other available Persian textual normalization tools indicates the superiority of the proposed method in speech processing. Also, comparing the model's performance for one of the proposed functions (sentence separation) with other common natural language libraries such as HAZM and Parsivar indicates the proper performance of the proposed method. Besides, its evaluation of some Persian Wikipedia data confirms the proper performance of the proposed method.
翻译:一般而言,语言处理模式由语言模型和声学模型组成。无论语言模型的复杂性和变式如何,语言模型都需要三个关键的预处理步骤:清洁、正常化和象征性。在所述步骤中,正常化步骤对于纯文本应用程序的统一格式至关重要。但是,对于语言处理模块中的嵌入语言模型来说,正常化并不局限于格式的统一。此外,它必须将每个可读符号、数字等转换为语言表达方式的表达方式。据我们所知,在语言处理模块中,没有嵌入语言模型的波斯语正常化工具包,因此,在本文件中,我们建议为语言应用中的文本处理提供一个开放源的标准化工具包。简而言之,我们认为,在语言处理中,可以读的波斯文本版本不同,例如符号(通用货币、#@@、 URL等)、数字(日期、时间、电话号码、国家代码等)等,等等。与其他可用的波斯语文本标准化工具的比较表明拟议语言处理方法的优越性。此外,将拟议功能之一的模型(感应分化)与拟议正常的自然语言方法相比,例如HAZ和正常的成绩。