In a hybrid speech model, both voiced and unvoiced components can coexist in a segment. Often, the voiced speech is regarded as the deterministic component, and the unvoiced speech and additive noise are the stochastic components. Typically, the speech signal is considered stationary within fixed segments of 20-40 ms, but the degree of stationarity varies over time. For decomposing noisy speech into its voiced and unvoiced components, a fixed segmentation may be too crude, and we here propose to adapt the segment length according to the signal local characteristics. The segmentation relies on parameter estimates of a hybrid speech model and the maximum a posteriori (MAP) and log-likelihood criteria as rules for model selection among the possible segment lengths, for voiced and unvoiced speech, respectively. Given the optimal segmentation markers and the estimated statistics, both components are estimated using linear filtering. A codebook-based approach differentiates between unvoiced speech and noise. A better extraction of the components is possible by taking into account the adaptive segmentation, compared to a fixed one. Also, a lower distortion for voiced speech and higher segSNR for both components is possible, as compared to other decomposition methods.
翻译:在混合语言模型中,声音和声音都可同时存在于一个部分中。通常,声音的表达方式被视为确定性组成部分,声音和添加剂的噪音通常被视为确定性组成部分,声音信号一般被视为固定的20-40毫秒,但静止程度随时间而不同。将吵杂的言论分解成其表达和声音的成分时,固定的分解方式可能太粗糙,我们在此提议根据信号本地特性调整部分长度。分解方式依赖于混合语言模型的参数估计值以及后传(MAP)和行似日志的最大值标准,作为在可能的部分长度中分别选择表态和无声音的示范规则。鉴于最佳的分解标记和估计统计数据,这两个部分都使用线性过滤法估计。基于代码的分解方法可能区分声音和噪音。通过考虑到适应性分解,与固定的分解方式相比,可以更好地提取部件。此外,与其它部分的分解方法相比,对语音和高正反射法的偏差是可能的。