Syllable detection is an important speech analysis task with applications in speech rate estimation, word segmentation, and automatic prosody detection. Based on the well understood acoustic correlates of speech articulation, it has been realized by local peak picking on a frequency-weighted energy contour that represents vowel sonority. While several of the analysis parameters are set based on known speech signal properties, the selection of the frequency-weighting coefficients and peak-picking threshold typically involves heuristics, raising the possibility of data-based optimisation. In this work, we consider the optimization of the parameters based on the direct minimization of naturally arising task-specific objective functions. The resulting non-convex cost function is minimized using a population-based search algorithm to achieve a performance that exceeds previously published performance results on the same corpus using a relatively low amount of labeled data. Further, the optimisation of system parameters on a different corpus is shown to result in an explainable change in the optimal values.
翻译:在语音估计、单词分解和自动流体检测等应用中,可调频检测是一项重要的语音分析任务。根据对语音表达的声学关联性所理解的透彻认识,通过本地峰值选择代表元音重体的频率加权能量轮廓来实现。虽然根据已知的语音信号特性设定了若干分析参数,但选择频率加权系数和最高选择阈值通常涉及超常学,提高基于数据优化的可能性。在这项工作中,我们考虑以直接尽量减少自然产生的特定任务目标功能为基础优化参数。由此产生的非康维克斯成本功能通过基于人口的搜索算法最小化,以达到超过先前公布的同一物体的性能效果,使用相对较少的标签数据。此外,对不同体的系统参数的优化显示导致最佳值的可解释变化。